部署满血DeepSeek R1的避坑指南-vLLM 0.7.1

今天看到vLLM的朋友圈发布了DeepSeek R1的PP支持，立刻开始我的捣鼓之旅，假如我训练的超大MoE上线了，也得做好技术准备工作是不嘛。把踩坑经验给大家分享一下，希望能够相比于官方文档更白话一点。知乎@游凯超说一定要让整个过程变得丝滑无比，我俩配合做了几个验证，现在应该只需要 Step0 和 Step3 就可以run起来了，如果遇到autoscalar的相关问题可以看Step1可以解决。

m0_63171455

4476人浏览 · 2025-02-05 21:13:35

m0_63171455 · 2025-02-05 21:13:35 发布

今天看到vLLM的朋友圈发布了DeepSeek R1的PP支持，立刻开始我的捣鼓之旅，假如我训练的超大MoE上线了，也得做好技术准备工作是不嘛。把踩坑经验给大家分享一下，希望能够相比于官方文档更白话一点。

Distributed Inference and Serving: https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes

知乎@游凯超说一定要让整个过程变得丝滑无比，我俩配合做了几个验证，现在应该只需要 Step0 和 Step3 就可以run起来了，如果遇到autoscalar的相关问题可以看Step1可以解决。

Step 0 Prepare weights & Environment

由于权重太大了，即使你网速可以，也不建议直连下载了。大家可以先从HF及或代理弄一份权重回来，直连大概率直接超时或者把公网IP打爆。我们今天展示的多机多卡8xH20 (x2) 部署，对应TP size 8，PP size 2，所以要搞两台这样的机器过来。同时有一个假设：两机的网络互通，不一定需要IB，储存需要共享（NAS或OSS均可），完成准备工作之后便可以做第一步。

Step 1 Setup up Ray & Cluster

官方文档里面简单带过了这一部分，但这个是我被卡时间太久的问题。首先我说一下官方文档的意思，就是让你准备好两个节点，之间用ray start这个CLI去建立好ray集群。因为后面要用，但是比较坑的有两点，第一点是启动的命令似乎有点点问题，我在前几次尝试的时候都遇到了Ray的autoscaler报错的问题：

`(autoscaler +1m19s) Error: No available node types can fulfill resource request {'node:33.18.26.153': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.   (autoscaler +1m54s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.   (autoscaler +2m29s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.   INFO 02-02 09:39:14 ray_utils.py:212] Waiting for creating a placement group of specs for 150 seconds. specs=[{'node:33.18.26.153': 0.001, 'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.   `

这看起来就很奇怪，因为vLLM找Ray集群要的Resource是custom resource，‘node:33.18.26.153’:0.001，这可以理解成vLLM优先要driver节点。但是这个东西我印象中是需要启动ray的时候自己设置的：

https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources

像这样才会有这种resource。背后的原因是对于多（虚拟）网卡的机器会有多个网段，vLLM assume使用POD IP来做Ray的master寻址。

解法1：设置 VLLM_HOST_IP

# Get local IP address and set on every node before Ray start   VLLM_HOST_IP=$(hostname -I | awk '{print $1}')   export VLLM_HOST_IP

解法2：魔改Ray启动逻辑

def get_actual_ip():       """Get the actual IP address of the current machine."""       try:           # Create a socket to connect to an external server (doesn't actually connect)           s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)           s.connect(('8.8.8.8', 80))           ip = s.getsockname()[0]           s.close()           return ip       except Exception:           # Fallback to hostname-based IP resolution           return socket.gethostbyname(socket.gethostname())      def start_ray_cluster():       free_ports = get_free_ports()       port = free_ports[0]       node_manager_port = free_ports[1]       master_addr = get_master_addr()       rank = get_rank()       node_ip = get_actual_ip()  # Use the new function to get actual IP              # Define custom resource based on node IP       resource_spec = f'--resources=\'{{"node:{node_ip}": 1}}\''              if rank == 0:           cmd = f"ray start --head --port={port} --node-ip-address={master_addr} --node-manager-port {node_manager_port} --node-name={master_addr} {resource_spec}"       else:           cmd = f"ray start --address={master_addr}:{port} --node-manager-port {node_manager_port} --node-name={get_addr()} {resource_spec}"              if ray.is_initialized():           print("Ray is already initialized, skipping node level init.")       else:           stop_cmd = "ray stop"           execute(stop_cmd, check=True)           print(f"Executing Ray start command: {cmd}")           execute(cmd, check=True)

其中execute可以这样写，

import time   import subprocess      def execute(cmd, check=False, retry=1):       ret = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=check)       state = ret.returncode == 0       msg = ret.stdout if state else ret.stderr       if not state and retry > 1:           print(f"execute {cmd} got error {msg}, retry...")           time.sleep(1)           return execute(cmd, check, retry-1)       return state, msg

然后这里我稍微提一下ray的一些基础玩法：大家在使用Ray的时候一般都不是在裸机上面的，大部分深度学习的资源都是k8s结合kubeflow或者volcano这样的插件分发出来的。环境变量里面会有当前是第几个rank，头结点master_addr这样的信息，大家可以根据自己的需要把这些函数实现一下。比较坑的 {resource_spec} 这里我已经替大家把坑给填了。

Step 2 Other small bugs

期间又报了两个错误，花了一点时间修复：

Traceback (most recent call last):     File "/usr/local/bin/vllm", line 5, in <module>       from vllm.scripts import main     File "/usr/local/lib/python3.10/dist-packages/vllm/__init__.py", line 4, in <module>       from vllm.engine.async_llm_engine import AsyncLLMEngine     File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 15, in <module>       from vllm.engine.llm_engine import (DecoderPromptComponents, LLMEngine,     File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 24, in <module>       from vllm.engine.output_processor.interfaces import (     File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/interfaces.py", line 6, in <module>       from vllm.engine.output_processor.stop_checker import StopChecker     File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/stop_checker.py", line 6, in <module>       from vllm.transformers_utils.tokenizer import AnyTokenizer     File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizer.py", line 13, in <module>       from vllm.transformers_utils.tokenizers import (BaichuanTokenizer,     File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizers/__init__.py", line 2, in <module>       from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer     File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/tokenizers/mistral.py", line 9, in <module>       from mistral_common.tokens.tokenizers.mistral import ChatCompletionRequest     File "/usr/local/lib/python3.10/dist-packages/mistral_common/tokens/tokenizers/mistral.py", line 32, in <module>       from mistral_common.tokens.tokenizers.multimodal import (     File "/usr/local/lib/python3.10/dist-packages/mistral_common/tokens/tokenizers/multimodal.py", line 6, in <module>       import cv2     File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>       bootstrap()     File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 175, in bootstrap       if __load_extra_py_code_for_module("cv2", submodule, DEBUG):     File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 28, in __load_extra_py_code_for_module       py_module = importlib.import_module(module_name)     File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module       return _bootstrap._gcd_import(name[level:], package, level)     File "/usr/local/lib/python3.10/dist-packages/cv2/typing/__init__.py", line 171, in <module>       LayerId = cv2.dnn.DictValue   AttributeError: module 'cv2.dnn' has no attribute 'DictValue'

一个opencv封建余孽的问题，pin住opencv的版本来解决

pip install opencv-python-headless==4.5.4.58

还有一个load之后报TypeError的问题

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 472, in forward   [rank0]:     kv_c, k_pe = self.kv_a_proj_with_mqa(hidden_states)[0].split(   [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl   [rank0]:     return self._call_impl(*args, **kwargs)   [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl   [rank0]:     return forward_call(*args, **kwargs)   [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 246, in forward   [rank0]:     output = self.quant_method.apply(self, x, bias)   [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 357, in apply   [rank0]:     return apply_w8a8_block_fp8_linear(   [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 61, in apply_w8a8_block_fp8_linear   [rank0]:     output = w8a8_block_fp8_matmul(q_input,   [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 470, in w8a8_block_fp8_matmul   [rank0]:     configs = get_w8a8_block_fp8_configs(N, K, block_size[0], block_size[1])   [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 407, in get_w8a8_block_fp8_configs   [rank0]:     device_name = current_platform.get_device_name().replace(" ", "_")   [rank0]: TypeError: a bytes-like object is required, not 'str'

通过升级 pynvml 解决

pip install pynvml -U

Step 3 Run the model

这一步反而是最简单的：

vllm serve /your/path/to_checkpoint_deepseek-r1/ --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --host 0.0.0.0

由于有了PP加持，没有IB的同学也可以尝试把sequence length和bsz给稍微拉大一些拉。用gaoce哥哥贡献的Reasoning Output，在同一台机器来试一把，或者换一台机器把localhost改了：

from openai import OpenAI      # Modify OpenAI's API key and API base to use vLLM's API server.   openai_api_key = "EMPTY"   openai_api_base = "http://localhost:8000/v1"      client = OpenAI(       api_key=openai_api_key,       base_url=openai_api_base,   )      models = client.models.list()   model = models.data[0].id      # Round 1   messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]   response = client.chat.completions.create(model=model, messages=messages)      reasoning_content = response.choices[0].message.reasoning_content   content = response.choices[0].message.content      print("reasoning_content:", reasoning_content)   print("content:", content)

对，你不是卡主了，是你的钱包不够厚。切到后台可以看到，这个prompt里面

INFO 02-02 14:18:52 metrics.py:453] Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.   INFO 02-02 14:18:57 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.   INFO 02-02 14:19:02 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.   INFO 02-02 14:19:07 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.   INFO 02-02 14:19:12 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.   INFO 02-02 14:19:17 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.   INFO 02-02 14:19:22 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.   INFO 02-02 14:19:27 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.

稍等一会他就会告诉你9.8更大了。

祝大家捣鼓顺利，感谢vLLM社区的工作。

https://github.com/vllm-project/vllm/pull/12679

凯超真 nb 春节在这做贴身客服，哈哈，RL仔现在不管原来是主修文还是主修理的，都先修infra吧。

在这里插入图片描述

如何学习AI大模型？

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述

第一阶段：从大模型系统设计入手，讲解大模型的主要方法；

第二阶段：在通过大模型提示词工程从Prompts角度入手更好发挥模型的作用；

第三阶段：大模型平台应用开发借助阿里云PAI平台构建电商领域虚拟试衣系统；

第四阶段：大模型知识库应用开发以LangChain框架为例，构建物流行业咨询智能问答系统；

第五阶段：大模型微调开发借助以大健康、新零售、新媒体领域构建适合当前领域大模型；

第六阶段：以SD多模态大模型为主，搭建了文生图小程序案例；

第七阶段：以大模型平台应用与开发为主，通过星火大模型，文心大模型等成熟大模型构建大模型行业应用。

在这里插入图片描述

👉学会后的收获：👈
• 基于大模型全栈工程实现（前端、后端、产品经理、设计、数据分析等），通过这门课可获得不同能力；

• 能够利用大模型解决相关实际项目需求：大数据时代，越来越多的企业和机构需要处理海量数据，利用大模型技术可以更好地处理这些数据，提高数据分析和决策的准确性。因此，掌握大模型应用开发技能，可以让程序员更好地应对实际项目需求；

• 基于大模型和企业数据AI应用开发，实现大模型理论、掌握GPU算力、硬件、LangChain开发框架和项目实战技能，学会Fine-tuning垂直训练大模型（数据准备、数据蒸馏、大模型部署）一站式掌握；

• 能够完成时下热门大模型垂直领域模型训练能力，提高程序员的编码能力：大模型应用开发需要掌握机器学习算法、深度学习框架等技术，这些技术的掌握可以提高程序员的编码能力和分析能力，让程序员更加熟练地编写高质量的代码。

在这里插入图片描述

1.AI大模型学习路线图
2.100套AI大模型商业化落地方案
3.100集大模型视频教程
4.200本大模型PDF书籍
5.LLM面试题合集
6.AI产品经理资源合集

👉获取方式：
😝有需要的小伙伴，可以保存图片到wx扫描二v码免费领取【保证100%免费】🆓

在这里插入图片描述

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

UFW防火墙安全指南

UFW（Uncomplicated Firewall）是Ubuntu/Debian系统中简化防火墙管理的工具，通过直观命令帮助用户有效控制网络流量，提升系统安全性。文章详细介绍了UFW的基本命令，包括启停防火墙、添加规则、限制连接速率和日志配置等操作，并提供了安全最佳实践，如默认拒绝策略、IP地址限制和服务级规则管理。同时，还涵盖高级配置技巧，例如多网络接口设置、规则优先级调整、IPv6支持及与f