使用 vLLM 应用验证推理节点

1. 制作镜像

为了方便测试，这里将模型文件打包到镜像中。

下载模型

1
2
3
4
git clone https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat
cd Qwen1.5-1.8B-Chat && git lfs pull
rm -rf .git
cd ..

编写 Dockerfile

1
2
3
4
5
cat <<EOF > Dockerfile
FROM vllm/vllm-openai:latest
RUN mkdir -p /models/Qwen1.5-1.8B-Chat
COPY Qwen1.5-1.8B-Chat/* /models/Qwen1.5-1.8B-Chat
EOF

编译镜像

1
nerdctl build --platform=amd64 -t registry-1.docker.io/shaowenchen/demo:vllm-qwen-1.5-1.8b-chat-amd64 .

推送镜像

1
nerdctl push --platform=amd64 registry-1.docker.io/shaowenchen/demo:vllm-qwen-1.5-1.8b-chat-amd64

2. 主机上推理服务

设置环境变量

国内

1
export IMAGE=shaowenchen/demo:vllm-qwen-1.5-1.8b-chat-amd64

国外

1
export IMAGE=registry-1.docker.io/shaowenchen/demo:vllm-qwen-1.5-1.8b-chat-amd64

指定设备，运行服务

1
2
3
4
5
6
7
nerdctl run --gpus "device=1" \
    -p 8000:8000 \
    --name Qwen1.5-1.8B-Chat-allinone \
    --ipc=host \
    $IMAGE \
    --model /models/Qwen1.5-1.8B-Chat \
    --dtype=half

测试推理接口

1
2
3
4
5
6
7
8
9
curl http://127.0.0.1:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
         "model": "/models/Qwen1.5-1.8B-Chat",
         "messages": [
             {"role": "user", "content": "什么是大模型"}
         ],
         "max_tokens": 1024
     }'

清理容器

1
nerdctl rm Qwen1.5-1.8B-Chat-allinone

3. 集群上推理服务

设置环境变量

国内

1
export IMAGE=shaowenchen/demo:vllm-qwen-1.5-1.8b-chat-amd64

国外

1
export IMAGE=registry-1.docker.io/shaowenchen/demo:vllm-qwen-1.5-1.8b-chat-amd64

设置运行节点

1
export NODE_NAME=

部署负载

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-vllm-qwen1.5-1.8b-chat
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo-vllm-qwen1.5-1.8b-chat
  template:
    metadata:
      labels:
        app: demo-vllm-qwen1.5-1.8b-chat
    spec:
      nodeName: $NODE_NAME
      containers:
      - name: demo-vllm-qwen
        image: $IMAGE
        args:
            - "--dtype"
            - "half"
            - "--model"
            - "/models/Qwen1.5-1.8B-Chat"
EOF

测试推理接口

1
kubectl exec -it deployment/demo-vllm-qwen1.5-1.8b-chat -- bash

1
2
3
4
5
6
7
8
9
curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/models/Qwen1.5-1.8B-Chat",
        "messages": [
            {"role": "user", "content": "什么是大模型"}
        ],
        "max_tokens": 1024
    }'

清理负载

1
kubectl delete deployment demo-vllm-qwen1.5-1.8b-chat

使用 vLLM 应用验证推理节点

1. 制作镜像

2. 主机上推理服务

3. 集群上推理服务

相关内容