1. 制作镜像
为了方便测试,这里将模型文件打包到镜像中。
1
2
3
4
| git clone https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat
cd Qwen1.5-1.8B-Chat && git lfs pull
rm -rf .git
cd ..
|
1
2
3
4
5
| cat <<EOF > Dockerfile
FROM vllm/vllm-openai:latest
RUN mkdir -p /models/Qwen1.5-1.8B-Chat
COPY Qwen1.5-1.8B-Chat/* /models/Qwen1.5-1.8B-Chat
EOF
|
1
| nerdctl build --platform=amd64 -t registry-1.docker.io/shaowenchen/demo-vllm-qwen:1.5-1.8b-chat-amd64 .
|
1
| nerdctl push --platform=amd64 registry-1.docker.io/shaowenchen/demo-vllm-qwen:1.5-1.8b-chat-amd64
|
为了方便国内的集群测试,我将镜像推送到了阿里云的容器镜像服务,registry.cn-beijing.aliyuncs.com/shaowenchen/demo-vllm-qwen
2. 主机上推理服务
国内
1
| export IMAGE=registry.cn-beijing.aliyuncs.com/shaowenchen/demo-vllm-qwen:1.5-1.8b-chat-amd64
|
国外
1
| export IMAGE=registry-1.docker.io/shaowenchen/demo-vllm-qwen:1.5-1.8b-chat-amd64
|
1
2
3
4
5
6
7
| nerdctl run --gpus "device=1" \
-p 8000:8000 \
--name Qwen1.5-1.8B-Chat-allinone \
--ipc=host \
$IMAGE \
--model /models/Qwen1.5-1.8B-Chat \
--dtype=half
|
1
2
3
4
5
6
7
8
9
| curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/Qwen1.5-1.8B-Chat",
"messages": [
{"role": "user", "content": "什么是大模型"}
],
"max_tokens": 1024
}'
|
1
| nerdctl rm Qwen1.5-1.8B-Chat-allinone
|
3. 集群上推理服务
国内
1
| export IMAGE=registry.cn-beijing.aliyuncs.com/shaowenchen/demo-vllm-qwen:1.5-1.8b-chat-amd64
|
国外
1
| export IMAGE=registry-1.docker.io/shaowenchen/demo-vllm-qwen:1.5-1.8b-chat-amd64
|
设置运行节点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-vllm-qwen1.5-1.8b-chat
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: demo-vllm-qwen1.5-1.8b-chat
template:
metadata:
labels:
app: demo-vllm-qwen1.5-1.8b-chat
spec:
nodeName: $NODE_NAME
containers:
- name: demo-vllm-qwen
image: $IMAGE
args:
- "--dtype"
- "half"
- "--model"
- "/models/Qwen1.5-1.8B-Chat"
EOF
|
1
| kubectl exec -it deployment/demo-vllm-qwen1.5-1.8b-chat -- bash
|
1
2
3
4
5
6
7
8
9
| curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/Qwen1.5-1.8B-Chat",
"messages": [
{"role": "user", "content": "什么是大模型"}
],
"max_tokens": 1024
}'
|
1
| kubectl delete deployment demo-vllm-qwen1.5-1.8b-chat
|