opscli 使用案例

在 kubectl pod 中测试指定节点的磁盘 IO 性能

  • 安装 opscli for alpine
sed -i 's/dl-cdn.alpinelinux.org/mirrors.aliyun.com/g' /etc/apk/repositories
apk add curl
curl -sfL https://ghproxy.chenshaowen.com/https://raw.githubusercontent.com/shaowenchen/ops/main/getcli.sh |VERSION=latest sh -
  • 在节点安装 fio
opscli shell --content "apt-get install fio -y" --nodename node1
  • 在节点上测试磁盘 IO 性能
opscli task -f ~/.ops/tasks/get-diskio-byfio.yaml --size 1g --filename=/tmp/testfile --nodename node1

其中 size 为测试文件大小,filename 为测试文件路径,nodename 为测试节点名称。

(1/8) Rand_Read_Testing

read: IOPS=105k, BW=410MiB/s (430MB/s)(1024MiB/2498msec) -> 4k 随机读 410 MiB/s

(2/8) Rand_Write_Testing

write: IOPS=55.9k, BW=218MiB/s (229MB/s)(1024MiB/4688msec) -> 4k 随机写 218 MiB/s

(3/8) Sequ_Read_Testing

read: IOPS=51.8k, BW=6481MiB/s (6796MB/s)(1024MiB/158msec) -> 128k 顺序读 6481 MiB/s

(4/8) Sequ_Write_Testing

write: IOPS=30.7k, BW=3835MiB/s (4022MB/s)(1024MiB/267msec) -> 128k 顺序写 3835 MiB/s

(5/8) Rand_Read_IOPS_Testing

read: IOPS=80.4k, BW=314MiB/s (329MB/s)(1024MiB/3261msec) -> 4k 下读 IOPS 为 80.4k

(6/8) Rand_Write_IOPS_Testing

write: IOPS=83.4k, BW=326MiB/s (342MB/s)(1024MiB/3143msec) -> 4k 下写 IOPS 为 83.4k

(7/8) Rand_Read_Latency_Testing

lat (usec): min=34, max=457722, avg=57.78, stdev=1630.32 -> 4k 读延时为 57.78 us

(8/8) Rand_Write_Latency_Testing

lat (usec): min=35, max=664838, avg=385.12, stdev=5335.64 -> 4k 写延时为 385.12 us

给集群 GPU 主机配置巡检

  • 在全部 master 节点上安装 Opscli
opscli task -f ~/.ops/tasks/install-opscli.yaml -i master-ips.txt
  • 在能 ssh 全部节点的机器上,创建访问主机的 ssh 密钥
kubectl -n ops-system create secret generic host-secret --from-file=privatekey=/root/.ssh/id_rsa
  • 添加全部 task 模板
kubectl apply -f ~/.ops/tasks/
  • 自动发现主机
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: auto-create-host
  namespace: ops-system
spec:
  crontab: 40 * * * *
  taskRef: auto-create-host
EOF
  • 自动打上标签
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: alert-label-gpu
  namespace: ops-system
spec:
  crontab: 40 * * * *
  taskRef: alert-label-gpu
  • GPU 掉卡巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: alert-gpu-drop
  namespace: ops-system
spec:
  crontab: 40 * * * *
  taskRef: alert-gpu-drop
EOF
  • GPU ECC 巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: alert-gpu-ecc
  namespace: ops-system
spec:
  crontab: 40 * * * *
  taskRef: alert-gpu-ecc
EOF
  • GPU Fabric 巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: alert-gpu-fabric
  namespace: ops-system
spec:
  crontab: 40 * * * *
  taskRef: alert-gpu-fabric
EOF
  • GPU Zombie 巡检
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: alert-gpu-zombie
  namespace: ops-system
spec:
  crontab: 40 * * * *
  taskRef: alert-gpu-zombie
EOF
  • 定时清理磁盘
kubectl apply -f - <<EOF
apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
  name: clear-disk
  namespace: ops-system
spec:
  crontab: 0 1 * * *
  taskRef: clear-disk
EOF

results matching ""

    No results matching ""

    results matching ""

      No results matching ""