opscli 使用案例
在 kubectl pod 中测试指定节点的磁盘 IO 性能
- 安装 opscli for alpine
sed -i 's/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g' /etc/apk/repositories
apk add curl
curl -sfL https://ghproxy.chenshaowen.com/https://raw.githubusercontent.com/shaowenchen/ops/main/getcli.sh |VERSION=latest sh -
- 在节点安装 fio
opscli shell --content "apt-get install fio -y" --nodename node1
- 在节点上测试磁盘 IO 性能
opscli task -f ~/.ops/tasks/get-diskio-byfio.yaml --size 1g --filename=/tmp/testfile --nodename node1
其中 size 为测试文件大小,filename 为测试文件路径,nodename 为测试节点名称。
(1/8) Rand_Read_Testing
read: IOPS=105k, BW=410MiB/s (430MB/s)(1024MiB/2498msec) -> 4k 随机读 410 MiB/s
(2/8) Rand_Write_Testing
write: IOPS=55.9k, BW=218MiB/s (229MB/s)(1024MiB/4688msec) -> 4k 随机写 218 MiB/s
(3/8) Sequ_Read_Testing
read: IOPS=51.8k, BW=6481MiB/s (6796MB/s)(1024MiB/158msec) -> 128k 顺序读 6481 MiB/s
(4/8) Sequ_Write_Testing
write: IOPS=30.7k, BW=3835MiB/s (4022MB/s)(1024MiB/267msec) -> 128k 顺序写 3835 MiB/s
(5/8) Rand_Read_IOPS_Testing
read: IOPS=80.4k, BW=314MiB/s (329MB/s)(1024MiB/3261msec) -> 4k 下读 IOPS 为 80.4k
(6/8) Rand_Write_IOPS_Testing
write: IOPS=83.4k, BW=326MiB/s (342MB/s)(1024MiB/3143msec) -> 4k 下写 IOPS 为 83.4k
(7/8) Rand_Read_Latency_Testing
lat (usec): min=34, max=457722, avg=57.78, stdev=1630.32 -> 4k 读延时为 57.78 us
(8/8) Rand_Write_Latency_Testing
lat (usec): min=35, max=664838, avg=385.12, stdev=5335.64 -> 4k 写延时为 385.12 us
给集群 GPU 主机配置巡检
- 在全部 master 节点上安装 Opscli
opscli task -f ~/.ops/tasks/install-opscli.yaml -i master-ips.txt
- 在能 ssh 全部节点的机器上,创建访问主机的 ssh 密钥
kubectl -n ops-system create secret generic host-secret --from-file=privatekey=/root/.ssh/id_rsa
- 添加全部 task 模板
kubectl apply -f ~/.ops/tasks/
- 自动发现主机
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: auto-create-host
namespace: ops-system
spec:
crontab: 40 * * * *
taskRef: auto-create-host
EOF
- 自动打上标签
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: alert-label-gpu
namespace: ops-system
spec:
crontab: 40 * * * *
taskRef: alert-label-gpu
- GPU 掉卡巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: alert-gpu-drop
namespace: ops-system
spec:
crontab: 40 * * * *
taskRef: alert-gpu-drop
EOF
- GPU ECC 巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: alert-gpu-ecc
namespace: ops-system
spec:
crontab: 40 * * * *
taskRef: alert-gpu-ecc
EOF
- GPU Fabric 巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: alert-gpu-fabric
namespace: ops-system
spec:
crontab: 40 * * * *
taskRef: alert-gpu-fabric
EOF
- GPU Zombie 巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: alert-gpu-zombie
namespace: ops-system
spec:
crontab: 40 * * * *
taskRef: alert-gpu-zombie
EOF
- 定时清理磁盘
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: clear-disk
namespace: ops-system
spec:
crontab: 0 1 * * *
taskRef: clear-disk
EOF