使用 nvidia-container-runtime 的朋友可以重点关注下,特别是还有 JuiceFS 的情况。
1. 突然收到告警,我慌了
周末,学习 TensorRT LLM,顺便给线上最大的正式集群安装了一下 Dragonfly,然后就去买菜了。
下午发现有个节点的 Dragonfly Daemon 没起来,一直告警,就去所在节点重启了下 Kubelet。
大约 10 分钟之后,开始收到线上告警。
1
2
3
4
5
6
7
8
| 日志分析告警
时间:2024-01-20 15:26:25
标题:过去20s 状态码503数量超过阈值 547 >= 3
详情:
前三失败接口:
path: /api/xxx/v2/models/xxx/versions/1/infer 数量: 237
path: /api/xxx/v2/models/xxx/versions/1/infer 数量: 188
path: /api/xxx/v2/models/xxx/versions/1/infer 数量: 122
|
而且不停告警,来 AI 部门不久,丹还没练出来,就碰到这事。我可是啥都没干,哼!
还是先解决问题,看了眼 Kubelet 日志
1
| kubelet[4031671]: E0120 15:31:57.357711 4031671 remote_runtime.go:209] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to start sandbox container for pod \"dragonfly-dfdaemon-wznfl\": Error response from daemon: write /run/containerd/io.containerd.runtime.v2.task/moby/2b6a70e32f77707d88b73d28054bb83aed34d9ac90c0993df9d1209bd5402b84/config.json: no space left on device: unknown"
|
查看磁盘情况
1
2
3
| df -h
tmpfs 51G 51G 0 100% /run
|
发现 /run 存储空间不够,于是就在 /run/containerd/io.containerd.runtime.v2.task/moby
目录下查看大文件
1
2
3
4
5
| 6.3G /run/containerd/io.containerd.runtime.v2.task/moby/6be974f9293ab553bf86f0eda38b7813f315a98e63895eddaccb6b290ef6a1ac/log.json
6.3G /run/containerd/io.containerd.runtime.v2.task/moby/811c9a0aabb50f5ca73e6ee529f41745c2e18568a160f42314caaba142562c6b/log.json
7.9G /run/containerd/io.containerd.runtime.v2.task/moby/398ecf3c1488b4f8ec0f0ad12ac0a1080355fbd8102f6ed21980c7d7637ec7d2/log.json
8.1G /run/containerd/io.containerd.runtime.v2.task/moby/ab7253b7bbd05c8fe017008de2ec4494b4c11f2d55b980927d2fdcb3b306c924/log.json
9.6G /run/containerd/io.containerd.runtime.v2.task/moby/6030f2ad8532162cfa0effb479a9cd3f31c894c2152dfe34ddc59244b53f6241/log.json
|
破案了,直接清空这些文件内容,服务立马恢复了。
2. 复现 log.json 日志增长现象
2.1 查看 log.json 并分析来源
先查看一下日志内容
1
| {"level":"info","msg":"Running with config:\n{\n \"AcceptEnvvarUnprivileged\": true,\n \"NVIDIAContainerCLIConfig\": {\n \"Root\": \"\"\n },\n \"NVIDIACTKConfig\": {\n \"Path\": \"nvidia-ctk\"\n },\n \"NVIDIAContainerRuntimeConfig\": {\n \"DebugFilePath\": \"/dev/null\",\n \"LogLevel\": \"info\",\n \"Runtimes\": [\n \"docker-runc\",\n \"runc\"\n ],\n \"Mode\": \"auto\",\n \"Modes\": {\n \"CSV\": {\n \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n },\n \"CDI\": {\n \"SpecDirs\": null,\n \"DefaultKind\": \"nvidia.com/gpu\",\n \"AnnotationPrefixes\": [\n \"cdi.k8s.io/\"\n ]\n }\n }\n },\n \"NVIDIAContainerRuntimeHookConfig\": {\n \"Path\": \"/usr/bin/nvidia-container-runtime-hook\",\n \"SkipModeDetection\": false\n }\n}","time":"2024-01-23T09:43:36+08:00"}
|
看起来不像是应用的日志,实际上也不是,而是 runc 的日志,准确点来说是 nvidia-container-runtime 的日志。
而这里的 info 日志级别应该在 nvidia-container-runtime 的配置文件中可修改,移除掉 info 级别的日志即可。
但这个问题是否仅存在于 nvidia-container-runtime 呢?不,这是一个被忽略的普遍问题。
2.2 构造一个测试负载
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-whomai
spec:
replicas: 1
selector:
matchLabels:
app: demo-whomai
template:
metadata:
labels:
app: demo-whomai
spec:
containers:
- name: whomai
image: shaowenchen/demo-whomai:latest
readinessProbe:
exec:
command:
- sh
- -c
- '[ -e /random/ ]'
initialDelaySeconds: 1
periodSeconds: 1
EOF
|
经过反复地测试,我构造了上面这个负载。只需要注意两点即可触发 log.json 日志文件不断增长。
- 配置探针
- 探针命令要执行错误
shaowenchen/demo-whomai:latest
镜像中执行 sh
会报错。按照每秒探测一次,积累两个月的 log.json 文件能达到几个 GB 大小。
2.3 创建负载测试
根据上面的 yaml 创建一个负载
1
2
3
4
| kubectl get pod -l app=demo-whomai
NAME READY STATUS RESTARTS AGE
demo-whomai-966dd7875-jvvzr 0/1 Running 0 48m
|
由于健康检查未通过,Pod 不会处于 Ready 状态。
1
| CONTAINER_ID=$(kubectl get pod -l app=demo-whomai -ojson | jq -r '.items[0].status.containerStatuses[0].containerID | sub("docker://"; "")')
|
如果是 Docker 应该是 moby 命名空间下,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| ls -alh /run/containerd/io.containerd.runtime.v2.task/moby/$CONTAINER_ID
total 1.4M
drwx------ 3 root root 240 Jan 23 10:40 .
drwx--x--x 162 root root 3.2K Jan 23 11:29 ..
-rw-rw-rw- 1 root root 89 Jan 23 10:40 address
-rw-r--r-- 1 root root 9.5K Jan 23 10:40 config.json
-rw-r--r-- 1 root root 7 Jan 23 10:40 init.pid
prwx------ 1 root root 0 Jan 23 10:40 log
-rw-r--r-- 1 root root 1.4M Jan 23 11:31 log.json
-rw------- 1 root root 23 Jan 23 10:40 options.json
drwxr-xr-x 1 root root 4.0K Jan 23 10:40 rootfs
-rw------- 1 root root 0 Jan 23 10:40 runtime
-rw------- 1 root root 32 Jan 23 10:40 shim-binary-path
lrwxrwxrwx 1 root root 118 Jan 23 10:40 work -> /data/containerd/io.containerd.runtime.v2.task/k8s.io/240ccf68446af4a761273e5db08f6aebc362715f64d51efea785c26900c569c5
|
如果是 Containerd 应该是 k8s.io 命名空间下,ls -alh /run/containerd/io.containerd.runtime.v2.task/k8s.io/$CONTAINER_ID
如果是 Docker 应该是 moby 命名空间下,
1
2
3
4
5
| cat /run/containerd/io.containerd.runtime.v2.task/moby/$CONTAINER_ID/log.json
{"level":"error","msg":"exec failed: unable to start container process: exec: \"sh\": executable file not found in $PATH","time":"2024-01-23T11:32:01+08:00"}
{"level":"error","msg":"exec failed: unable to start container process: exec: \"sh\": executable file not found in $PATH","time":"2024-01-23T11:32:01+08:00"}
{"level":"error","msg":"exec failed: unable to start container process: exec: \"sh\": executable file not found in $PATH","time":"2024-01-23T11:32:01+08:00"}
|
如果是 Containerd 应该是 k8s.io 命名空间下,cat /run/containerd/io.containerd.runtime.v2.task/k8s.io/$CONTAINER_ID/log.json
3. 解决方案
3.1 直接清理
1
| curl -sfL https://ghproxy.chenshaowen.com/https://raw.githubusercontent.com/shaowenchen/ops/main/getcli.sh |VERSION=latest sh -
|
如果已经安装,需要执行 opscli upgrade
命令升级一下。
Docker 使用命令:
1
| opscli task -f ~/.ops/tasks/clear-biglog.yaml --logpath /run/containerd/io.containerd.runtime.v2.task/moby/ --logname "log.json" --size 100M -i ~/.kube/config --all
|
Containerd 使用命令:
1
| opscli task -f ~/.ops/tasks/clear-biglog.yaml --logpath /run/containerd/io.containerd.runtime.v2.task/k8s.io/ --logname "log.json" --size 100M -i ~/.kube/config --all
|
在查看命令的基础上,增加 --clear
参数,即可直接清理超过 100M 的 log.json 文件。
3.2 修改 nvidia-container-runtime 的日志级别
编辑 nvidia-container-runtime 的配置文件
1
2
3
4
| vim /etc/nvidia-container-runtime/config.toml
[nvidia-container-runtime]
log-level = "info"
|
将 log-level = "info"
修改为 log-level = "error"
可以避免输出,类似 {"level":"info","msg":"Running with config:\n{\n \"AcceptEnvvarUnprivileged\": true,\n \"NVIDIAContainerCLIConfig\": {\n \"Root\": \"\"\n },\n \"NVIDIACTKConfig\": {\n \"Path\": \"nvidia-ctk\"\n },\n \"NVIDIAContainerRuntimeConfig\": {\n \"DebugFilePath\": \"/dev/null\",\n \"LogLevel\": \"info\",\n \"Runtimes\": [\n \"docker-runc\",\n \"runc\"\n ],\n \"Mode\": \"auto\",\n \"Modes\": {\n \"CSV\": {\n \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n },\n \"CDI\": {\n \"SpecDirs\": null,\n \"DefaultKind\": \"nvidia.com/gpu\",\n \"AnnotationPrefixes\": [\n \"cdi.k8s.io/\"\n ]\n }\n }\n },\n \"NVIDIAContainerRuntimeHookConfig\": {\n \"Path\": \"/usr/bin/nvidia-container-runtime-hook\",\n \"SkipModeDetection\": false\n }\n}","time":"2024-01-23T09:43:36+08:00"}
的日志。
这个日志在 Docker 和 Containerd 的 io.containerd.runtime.v2
下会出现,在 io.containerd.runtime.v1
下反而没有。
3.3 修改 Containerd 的 state 目录
1
2
3
| vim /etc/containerd/config.toml
state = "/run/containerd"
|
将 state = "/run/containerd"
修改为 state = "/data/containerd"
,并且将 /data
目录挂载到额外的大磁盘上,这样即使 log.json 文件很大也不容易占满存储空间。
4. 参考