Please enable Javascript to view the contents

kube-proxy 异常导致节点上的 Pod 无法访问 Service

 ·  ☕ 3 分钟

1. 问题描述

  • 相关 Pod
1
2
3
4
5
6
kubectl -n istio-system get pod -o wide

NAME                      READY   STATUS    RESTARTS   AGE   IP               NODE             NOMINATED NODE   READINESS GATES
istiod-647c7c9d95-7n7n6   1/1     Running   0          77m   10.244.173.51    docs-ai-a800-4   <none>           <none>
istiod-647c7c9d95-k6l88   1/1     Running   0          30m   10.244.210.160   ai-a40-2         <none>           <none>
istiod-647c7c9d95-pj82r   1/1     Running   0          51m   10.244.229.217   docs-ai-a800-2   <none>           <none>
  • 相关 Service
1
2
3
4
kubectl -n istio-system get svc

NAME     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                 AGE
istiod   ClusterIP   10.99.225.56   <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP   645d
1
2
3
4
kubectl -n istio-system get endpoints

NAME     ENDPOINTS                                                                   AGE
istiod   10.244.173.51:15012,10.244.210.160:15012,10.244.229.217:15012 + 9 more...   645d

Endpoints 与 Pod 的 IP 是一致的。

  • 测试结果

在异常节点上运行一个 Pod,用来测试网络连通性。

1
2
3
4
5
telnet 10.244.173.51 15012
Trying 10.244.173.51...
Connected to 10.244.173.51.
Escape character is '^]'.
^CConnection closed by foreign host.
1
2
3
4
5
telnet 10.244.210.160  15012
Trying 10.244.210.160...
Connected to 10.244.210.160.
Escape character is '^]'.
^CConnection closed by foreign host.
1
2
3
4
5
telnet 10.244.229.217  15012
Trying 10.244.229.217...
Connected to 10.244.229.217.
Escape character is '^]'.
^CConnection closed by foreign host.

访问服务的 Pod 是可以连通的,但是访问不了 Service。

1
2
telnet 10.99.225.56 15012
Trying 10.99.225.56...

2. 问题分析

2.1 查看 kube-apiserver 日志

1
kubectl -n kube-system logs kube-apiserver-ai-kas-master-01 --tail 100 -f
1
E0214 07:03:19.604150       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"

2.2 查看节点 kube-proxy 日志

1
kubectl -n kube-system logs  kube-proxy-6c9gr -f
1
2
3
4
5
6
7
E0328 05:01:29.303620       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized
W0328 05:01:59.745555       1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.EndpointSlice: Unauthorized
E0328 05:01:59.745603       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized
W0328 05:02:09.815386       1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.Service: Unauthorized
E0328 05:02:09.815433       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized
W0328 05:02:34.999987       1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.EndpointSlice: Unauthorized
E0328 05:02:35.000026       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized

看起来是 kube-proxy 没有将 Service 的信息更新到 iptables 规则中,导致 Pod 无法访问 Service。

2.3 查看 kube-proxy 凭证配置

  • 检查证书
1
kubectl -n kube-system exec -it kube-proxy-6c9gr cat /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
---
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTIzMDUxODA5MjUyOVoXDTMzMDUxNTA5MjUyOVowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAK7m
ZOYNrulW7CrJrJIG1UAojwfVpbC4nT3sclaCLhn/RsdMWrCcjOzxVUVV7fNyhOU1
dGBuja8OVO8191FDioworcXebjdWtDt+35Tas8/J1z3qH4cuLK9T0SIWMnShAOp1
TvE9/gIbDuPDwlqPsCuPANW9DXDmCxbzGwMqFdLLeClEKASITc4a6cPOuFJP4/lp
tZDfA0VuKnXiUFnt31jmIefFaLtZDbY3v5ry+ubrIKxfSmw3PfN/u0/LR+eg1GEG
YGIGBp8Kix/QQzzxhcfNWLRbYmqBJuR5DsXv/qS2ILNR/Jbbfgm7HiA3JKP+7pDr
56jaVDb4LcTv/9bKQAsCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wHQYDVR0OBBYEFOAoDQT7ZFaOU6QpRsUE0xGN0XDeMBUGA1UdEQQO
MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJ5IvgmCUPlwLL94Joll
i9YDla8pWFXemBub/aNsN7ub6bSerYH8vs1vS/ooerVSEojmC75HOPo1zq55s0iK
gpaLQtgmFYtt6GGDzhzjwg5BFEu4f7SO24aY2WCmbwsmYrLSNfeoVOw+02ammAw+
MwwdlaeNcV1UGYQSSYXM4L0F032SIqTVJgrM6uTKWHmdCutRIXAVPLgXGhIl1yaM
HXJVqstshnqR5GC/EVIx9e1onb518ItnpHwSnJaRZerV7itznu2SVYQQMksm1hTn
hElvYLbtWwM99NwWDVMz8F5TiO7y5xTa/3lUXzDvgIiTz8szOFgC5iJFtEnfEMXu
IOI=
-----END CERTIFICATE-----

将证书保存为 kube-proxy-temp.crt 文件,然后查看证书的有效期。

1
openssl x509 -in kube-proxy-temp.crt -noout -enddate
1
notAfter=May 15 09:25:29 2033 GMT
  • 检查 token
1
2
3
kubectl -n kube-system exec -it kube-proxy-7jf78 cat /var/run/secrets/kubernetes.io/serviceaccount/token

eyJhbGciOiJSUzI1NiIsImtpZCI6IjFjSGk1VWwweHE5cUJiTmhWV0dJTEdnejFEc2xIa21JVjIwOXM3MWVFem8ifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzI2NjU0MjU0LCJpYXQiOjE2OTUxMTgyNTQsImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoia3ViZS1wcm94eS03amY3OCIsInVpZCI6ImIzOTkwNDVmLWFlZjMtNGM3Yy1iOWMwLTVmZmIyZDFhMTJmNCJ9LCJzZXJ2aWNlYWNjb3VudCI6eyJuYW1lIjoia3ViZS1wcm94eSIsInVpZCI6IjAyNmYzZTIyLTkxOTMtNDdkMS04M2IxLWVjNjVjYmY3YjA2NCJ9LCJ3YXJuYWZ0ZXIiOjE2OTUxMjE4NjF9LCJuYmYiOjE2OTUxMTgyNTQsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlLXN5c3RlbTprdWJlLXByb3h5In0.WykJsfpP1wzjiI6Q6AMDmWLbcrPaSy6NGWhhP90Xfz5Oix3rVEEthAITJyjJHEcPBLNxgNuBc6OD3FYW10nBEeTjnv7dcTJnxxKy3q-u1aywOtOjherJOR3jimRclqFAmGf5TgnZ1qpI_UXRw4--K-WDIltRkz5EYXeNStCNsHMAoJdwY-H_l_ZT3MmEKo7zCmsgAuFarSKpuaffG3RirXNZ3SuzosIhbN6KpBQ_uzI9JZOanf7i5-n8fhGR6SMqxCEYhyFvBx4AwXNPjHfCXs7K3yVk3EzrJMr6aifxh86Xzpqs-mN7E1MJGxXilTa03Xd2YlfhCT45D6yjcTdqHQ

将 token 保存为 token 文件,然后查看 token 的有效期。

1
2
3
cat token | cut -d "." -f 2 | base64 -d 2>/dev/null | jq .exp

1726654254
1
2
3
date -d @1726654254

Wed 18 Sep 2024 06:10:54 PM CST

已经过期了,但为啥 kubelet 没有自动更新 Token?

3. 问题解决

3.1 重启异常 Pod

1
kubectl -n kube-system delete pod kube-proxy-6c9gr

删除异常 Pod 之后,Pod 中的 token 会重新生成,也就绕过了 Token 过期的问题。

3.2 检查其他 kube-proxy

kube-proxy 直接影响到流量的转发,这里又排查了一遍 kube-proxy 的日志,发现了其他节点也有类似错误。

1
kubectl -n kube-system logs  -l k8s-app=kube-proxy  -f --max-log-requests 999  --prefix | grep --line-buffered "Unauthorized"
1
2
[pod/kube-proxy-7jf78/kube-proxy] E0331 01:21:25.284912       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized
[pod/kube-proxy-872lj/kube-proxy] W0331 01:20:00.708171       1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.Service: Unauthorized

3.3 检查全部 Pod Token 有效期

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
kubectl get pods -A --field-selector=status.phase=Running -o json | jq -r '.items[] | {pod: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName} | @base64' | while read line; do
    data=$(echo $line | base64 --decode)
    pod=$(echo $data | jq -r '.pod')
    namespace=$(echo $data | jq -r '.namespace')
    node=$(echo $data | jq -r '.node')
    token=$(kubectl exec -n "$namespace" "$pod" -- cat /var/run/secrets/kubernetes.io/serviceaccount/token 2>/dev/null)
    exp=$(echo "$token" | cut -d "." -f2 | base64 -d 2>/dev/null | jq -r .exp 2>/dev/null)
    now=$(date +%s)
    echo "Pod: $pod, Namespace: $namespace, Node: $node, Exp: $exp"
    if [[ "$exp" -lt "$now" && "$exp" != "" ]]; then
        echo "Expired Token: Pod=$pod, Node=$node, Namespace=$namespace, Expiry=$(date -d @$exp)"
    fi
done

主要是 K8s 系统相关的组件会出现这个问题,kube-proxy、kube-controller-manager、kube-scheduler,需要重启 Pod 解决。

4. 总结

本文主要是记录了在指定节点上的 Pod 无法访问 Service 的问题,通过查看 kube-apiserver 和 kube-proxy 的日志,发现是 Token 过期导致的。

相关的解决办法主要有两种,一种是重启 Pod,一种是重启 kubelet。但在这个 Case 中,重启 kubelet 不生效。

另外,按照相关文档如果 client-go 版本小于 v11.0.0 或者 v0.15.0,系统不会自动重新加载并更新 Token,导致 Token 过期风险。而当前 K8s 版本为 v1.23.6,并不在覆盖范围内。

Containerd 的节点没有这个问题,有问题的节点集中在 Docker 环境,因此,也并没有深究这个问题,只是记录下解决办法。

举一反三,如果 kubelet 卡死或者异常,导致 kube-proxy 的 token 无法更新,也会导致流量转发的问题。


微信公众号
作者
微信公众号