NPU function error: aclrtSynchronizeStream(stream_), error code is 107003
加载模型时出错:NPU function error: aclrtSynchronizeStream(stream_), error code is 107003BashOrPython。
·
快速解决方案:
加载模型时出错:NPU function error: aclrtSynchronizeStream(stream_), error code is 107003
Bash
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
Or
Python
os.environ['PYTORCH_NPU_ALLOC_CONF'] = 'expandable_segments:False'
一、环境
-
系统:aarch64 aarch64 GNU/Linux Eulerosv2r13
-
设备:910B3
-
torch==2.1.0
-
torch-npu==2.1.0.post3-20240523
-
python=3.9.10
-
CANN 8.0.T13
二、问题现象(附报错日志上下文):
加载模型时出错:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
model = Qwen2VLForConditionalGeneration.from_pretrained(
"/home/zsy/zsy/Models/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("/home/zsy/zsy/Models/Qwen2-VL-7B-Instruct")
报错traceback:
(agent) [root@03e78bdcc010 Vid-RAG]# python test_env.py
tools module loaded
/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/zsy/zsy/Agent-4-Video/Vid-RAG/test_env.py", line 43, in <module>
model = Qwen2VLForConditionalGeneration.from_pretrained(
File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4264, in from_pretrained
) = cls._load_pretrained_model(
File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4777, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/transformers/modeling_utils.py", line 942, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 397, in set_module_tensor_to_device
clear_device_cache()
File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/accelerate/utils/memory.py", line 56, in clear_device_cache
torch.npu.empty_cache()
File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/torch_npu/npu/memory.py", line 143, in empty_cache
torch_npu._C._npu_emptyCache()
RuntimeError: unmapHandles:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:401 NPU function error: aclrtSynchronizeStream(stream_), error code is 107003
[ERROR] 2025-01-03-22:31:30 (PID:91978, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The stream is not in the current context.
Check whether the context where the stream is located is the same as the current context.
EE9999: Inner Error!
EE9999: 2025-01-03-22:31:30.241.883 Stream synchronize failed, stream is not in current ctx, stream_id=2.[FUNC:StreamSynchronize][FILE:api_impl.cc][LINE:1018]
TraceBack (most recent call last):
rtStreamSynchronize execute failed, reason=[stream not in current context][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107003[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
三、解决方法
huangyunlong
2个月前
建议关闭虚拟内存看下,
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
来源:RuntimeError: unmapHandles:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:401 NPU function error: aclrtSynchronizeStream(stream_), error code is 107003 · Issue #IB361T · Ascend/pytorch - Gitee.com
更多推荐
所有评论(0)