4 minute read

服务器配置

AI 服务器:华为 Atlas 800I A2 推理服务器

组件 规格
CPU 鲲鹏 920(5250)
NPU 昇腾 910B4(8X32G)
内存 1024GB
硬盘 系统盘:450GB SSDX2 RAID1
数据盘:3.5TB NVME SSDX4
操作系统 openEuler 22.03 LTS

安装

拉取 vLLM 镜像

docker pull quay.io/ascend/vllm-ascend:v0.9.2rc1

部署 LLM

Docker

设置环境变量

# 从 ModelScope 加载模型以加快下载速度
export VLLM_USE_MODELSCOPE=True

# 设置 max_split_size_mb 以减少内存碎片并避免内存不足
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

max_split_size_mb 可防止原生分配器分割大于此大小(以MB为单位)的块。这可以减少内存碎片化,并可能使一些临界工作负载在不耗尽内存的情况下完成。

运行容器

docker run -it --rm\
  --name vllm \
  --network host \
  --shm-size=1g \
  --device /dev/davinci_manager \
  --device /dev/hisi_hdc \
  --device /dev/devmm_svm \
  --device /dev/davinci0 \
  --device /dev/davinci1 \
  --device /dev/davinci2 \
  --device /dev/davinci3 \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  -v /root/.cache:/root/.cache \
  -v /models:/models \
  --entrypoint "vllm" \
  quay.io/ascend/vllm-ascend:v0.9.2rc1 \
  serve /models/Qwen/Qwen2.5-7B-Instruct \
        --served-model-name Qwen/Qwen2.5-7B-Instruct \
        --tensor-parallel-size 4 \
        --port 8000 \
        --max-model-len 26240

📌 Qwen2.5-7B-Instruct 总注意力头数(28),部署需要计算 28 是否能被张量并行大小(--tensor-parallel-size 4)整除”。2 和 4 可以,8 就不可以。

📌 --max-model-len 26240 不要超过 config.json 文件中的 max_position_embeddings=32768
添加 --max_model_len 选项,以避免因 Qwen2.5-7B 模型的最大序列长度(32768)大于可存储在KV缓存中的最大令牌数(26240)而导致的ValueError。根据HBM大小的不同,不同的NPU系列此数值会有所差异。请根据您的NPU系列修改为合适的值。

Docker Compose 部署

编写文件 compose.yml

模型 NPU(32G) 数
Qwen3-8B 1
Qwen3-30B-A3B 4
Qwen2.5-32B-Instruct 4
Qwen2.5-Coder-32B-Instruct 4
Qwen2.5-VL-32B-Instruct 4

Qwen3-8B

services:
  vllm:
    image: quay.io/ascend/vllm-ascend:v0.9.2rc1
    container_name: vllm
    restart: unless-stopped

    init: true

    # 网络 & 设备
    network_mode: host
    shm_size: 1g 
    devices:
      - /dev/davinci_manager
      - /dev/hisi_hdc
      - /dev/devmm_svm
      - /dev/davinci0

    # 卷映射
    volumes:
      - /usr/local/dcmi:/usr/local/dcmi
      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
      - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
      - /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
      - /etc/ascend_install.info:/etc/ascend_install.info
      - /root/.cache:/root/.cache
      - /models:/models

    environment:
      - VLLM_USE_MODELSCOPE=True
      - PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

    # 命令
    entrypoint: ["vllm"]
    command: >
      serve /models/Qwen/Qwen3-8B
      --served-model-name Qwen/Qwen3-8B
      --port 8000
      --max-model-len 32768

测试

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-8B",
        "prompt": "The future of AI is",
        "max_tokens": 7,
        "temperature": 0
    }'

Qwen3-30B-A3B

services:
  vllm:
    image: quay.io/ascend/vllm-ascend:v0.9.2rc1
    container_name: vllm
    restart: unless-stopped

    init: true

    # 网络 & 设备
    network_mode: host
    shm_size: 1g 
    devices:
      - /dev/davinci_manager
      - /dev/hisi_hdc
      - /dev/devmm_svm
      - /dev/davinci0
      - /dev/davinci1
      - /dev/davinci2
      - /dev/davinci3

    # 卷映射
    volumes:
      - /usr/local/dcmi:/usr/local/dcmi
      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
      - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
      - /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
      - /etc/ascend_install.info:/etc/ascend_install.info
      - /root/.cache:/root/.cache
      - /models:/models

    environment:
      - VLLM_USE_MODELSCOPE=True
      - PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

    # 命令
    entrypoint: ["vllm"]
    command: >
      serve /models/Qwen/Qwen3-30B-A3B
      --served-model-name Qwen/Qwen3-30B-A3B
      --tensor-parallel-size 4
      --enable_expert_parallel
      --port 8000
      --max-model-len 32768

测试

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-30B-A3B",
        "messages": [
            {"role": "user", "content": "Give me a short introduction to large language models."}
        ],
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "max_tokens": 4096
    }'

Qwen2.5-VL-32B-Instruct

services:
  vllm:
    image: quay.io/ascend/vllm-ascend:v0.9.2rc1
    container_name: vllm
    #restart: unless-stopped

    init: true

    # 网络 & 设备
    network_mode: host
    shm_size: 1g 
    devices:
      - /dev/davinci_manager
      - /dev/hisi_hdc
      - /dev/devmm_svm
      - /dev/davinci0
      - /dev/davinci1
      - /dev/davinci2
      - /dev/davinci3

    # 卷映射
    volumes:
      - /usr/local/dcmi:/usr/local/dcmi
      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
      - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
      - /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
      - /etc/ascend_install.info:/etc/ascend_install.info
      - /root/.cache:/root/.cache
      - /models:/models

    environment:
      - VLLM_USE_MODELSCOPE=True
      - PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

    # 命令
    entrypoint: ["vllm"]
    command: >
      serve /models/Qwen/Qwen2.5-VL-32B-Instruct
      --served-model-name Qwen/Qwen2.5-VL-32B-Instruct
      --tensor-parallel-size 4
      --port 8000
      --max-model-len 32768

测试

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-VL-32B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
                {"type": "text", "text": "What is the text in the illustrate?"}
            ]}
        ]
    }'

Qwen3-30B-A3B & Qwen2.5-Coder-32B-Instruct

services:
  vllm-instance1:
    image: quay.io/ascend/vllm-ascend:v0.9.2rc1
    container_name: vllm1
    restart: unless-stopped

    init: true

    # 网络 & 设备
    network_mode: host
    shm_size: 1g 
    devices:
      - /dev/davinci_manager
      - /dev/hisi_hdc
      - /dev/devmm_svm
      - /dev/davinci0
      - /dev/davinci1
      - /dev/davinci2
      - /dev/davinci3

    # 卷映射
    volumes:
      - /usr/local/dcmi:/usr/local/dcmi
      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
      - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
      - /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
      - /etc/ascend_install.info:/etc/ascend_install.info
      - /root/.cache:/root/.cache
      - /models:/models

    environment:
      - VLLM_USE_MODELSCOPE=True
      - PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

    # 命令
    entrypoint: ["vllm"]
    command: >
      serve /models/Qwen/Qwen3-30B-A3B
      --served-model-name Qwen/Qwen3-30B-A3B
      --tensor-parallel-size 4
      --enable_expert_parallel
      --port 8001
      --max-model-len 32768

  vllm-instance2:
    image: quay.io/ascend/vllm-ascend:v0.9.2rc1
    container_name: vllm2
    restart: unless-stopped

    init: true

    # 网络 & 设备
    network_mode: host
    shm_size: 1g 
    devices:
      - /dev/davinci_manager
      - /dev/hisi_hdc
      - /dev/devmm_svm
      - /dev/davinci4
      - /dev/davinci5
      - /dev/davinci6
      - /dev/davinci7

    # 卷映射
    volumes:
      - /usr/local/dcmi:/usr/local/dcmi
      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
      - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
      - /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
      - /etc/ascend_install.info:/etc/ascend_install.info
      - /root/.cache:/root/.cache
      - /models:/models

    environment:
      - VLLM_USE_MODELSCOPE=True
      - PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

    # 命令
    entrypoint: ["vllm"]
    command: >
      serve /models/Qwen/Qwen2.5-Coder-32B-Instruct
      --served-model-name Qwen/Qwen2.5-Coder-32B-Instruct
      --tensor-parallel-size 4
      --port 8002
      --max-model-len 32768

测试

  • Qwen/Qwen3-30B-A3B
    curl http://localhost:8001/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
          "model": "Qwen/Qwen3-30B-A3B",
          "messages": [
              {"role": "user", "content": "你是谁?"}
          ]
      }'
    
  • Qwen/Qwen2.5-Coder-32B-Instruct
    curl http://localhost:8002/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
          "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
          "messages": [
              {"role": "user", "content": "你是谁?"}
          ]
      }'
    

DeepSeek-V3-Pruning

services:
  vllm:
    image: quay.io/ascend/vllm-ascend:v0.9.2rc1
    container_name: vllm
    #restart: unless-stopped

    init: true

    # 网络 & 设备
    network_mode: host
    shm_size: 1g 
    devices:
      - /dev/davinci_manager
      - /dev/hisi_hdc
      - /dev/devmm_svm
      - /dev/davinci0
      - /dev/davinci1
      - /dev/davinci2
      - /dev/davinci3
      - /dev/davinci4
      - /dev/davinci5
      - /dev/davinci6
      - /dev/davinci7

    # 卷映射
    volumes:
      - /usr/local/dcmi:/usr/local/dcmi
      - /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
      - /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/
      - /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info
      - /etc/ascend_install.info:/etc/ascend_install.info
      - /root/.cache:/root/.cache
      - /models:/models

    environment:
      - VLLM_USE_MODELSCOPE=True
      - PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

    # 命令
    entrypoint: ["vllm"]
    command: >
      serve /models/vllm-ascend/DeepSeek-V3-Pruning
      --served-model-name DeepSeek-V3
      --tensor-parallel-size 8
      --enable_expert_parallel
      --port 8000
      --max-model-len 32768
      --enforce-eager

测试

curl 'http://localhost:8000/v1/chat/completions' \
    -H "Content-Type: application/json" \
    -d '{
        "model": "DeepSeek-V3",
        "max_tokens": 20,
        "messages": [ 
            { "role": "system", "content": "你是AI编码助手。" }, 
            { "role": "user", "content": "你是谁?" } 
        ]
    }'
{
  "id": "chatcmpl-0e99cef79a7a45baa7905491446c9bfa",
  "object": "chat.completion",
  "created": 1753604007,
  "model": "DeepSeek-V3",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "弯 overwhelming香味示盯着自己的人raphPercentage为建设配置文件048 endometrial电气 Objective迁inher۵ bat diferenci李玉",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 30,
    "completion_tokens": 20,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

不设置 max_tokens 参数,不会停止。输出的都是没有意义的 token。

参考资料

Updated: