快速解决方案:

加载模型时出错:NPU function error: aclrtSynchronizeStream(stream_), error code is 107003

Bash

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False

Or

Python

os.environ['PYTORCH_NPU_ALLOC_CONF'] = 'expandable_segments:False'

一、环境

  • 系统:aarch64 aarch64 GNU/Linux Eulerosv2r13

  • 设备:910B3

  • torch==2.1.0

  • torch-npu==2.1.0.post3-20240523

  • python=3.9.10

  • CANN 8.0.T13

二、问题现象(附报错日志上下文):

加载模型时出错:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "/home/zsy/zsy/Models/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("/home/zsy/zsy/Models/Qwen2-VL-7B-Instruct")

报错traceback:

(agent) [root@03e78bdcc010 Vid-RAG]# python test_env.py
tools module loaded
/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards:   0%|                                                            | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/zsy/zsy/Agent-4-Video/Vid-RAG/test_env.py", line 43, in <module>
    model = Qwen2VLForConditionalGeneration.from_pretrained(
  File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4264, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4777, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/transformers/modeling_utils.py", line 942, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 397, in set_module_tensor_to_device
    clear_device_cache()
  File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/accelerate/utils/memory.py", line 56, in clear_device_cache
    torch.npu.empty_cache()
  File "/home/ma-user/anaconda3/envs/agent/lib/python3.9/site-packages/torch_npu/npu/memory.py", line 143, in empty_cache
    torch_npu._C._npu_emptyCache()
RuntimeError: unmapHandles:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:401 NPU function error: aclrtSynchronizeStream(stream_), error code is 107003
[ERROR] 2025-01-03-22:31:30 (PID:91978, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The stream is not in the current context.
        Check whether the context where the stream is located is the same as the current context.
EE9999: Inner Error!
EE9999: 2025-01-03-22:31:30.241.883  Stream synchronize failed, stream is not in current ctx, stream_id=2.[FUNC:StreamSynchronize][FILE:api_impl.cc][LINE:1018]
        TraceBack (most recent call last):
        rtStreamSynchronize execute failed, reason=[stream not in current context][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107003[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

三、解决方法

huangyunlong
2个月前

建议关闭虚拟内存看下,

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False

来源:RuntimeError: unmapHandles:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:401 NPU function error: aclrtSynchronizeStream(stream_), error code is 107003 · Issue #IB361T · Ascend/pytorch - Gitee.com

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐