1. 问题描述
1
2
3
4
5
6
| kubectl -n istio-system get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
istiod-647c7c9d95-7n7n6 1/1 Running 0 77m 10.244.173.51 docs-ai-a800-4 <none> <none>
istiod-647c7c9d95-k6l88 1/1 Running 0 30m 10.244.210.160 ai-a40-2 <none> <none>
istiod-647c7c9d95-pj82r 1/1 Running 0 51m 10.244.229.217 docs-ai-a800-2 <none> <none>
|
1
2
3
4
| kubectl -n istio-system get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
istiod ClusterIP 10.99.225.56 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP 645d
|
1
2
3
4
| kubectl -n istio-system get endpoints
NAME ENDPOINTS AGE
istiod 10.244.173.51:15012,10.244.210.160:15012,10.244.229.217:15012 + 9 more... 645d
|
Endpoints 与 Pod 的 IP 是一致的。
在异常节点上运行一个 Pod,用来测试网络连通性。
1
2
3
4
5
| telnet 10.244.173.51 15012
Trying 10.244.173.51...
Connected to 10.244.173.51.
Escape character is '^]'.
^CConnection closed by foreign host.
|
1
2
3
4
5
| telnet 10.244.210.160 15012
Trying 10.244.210.160...
Connected to 10.244.210.160.
Escape character is '^]'.
^CConnection closed by foreign host.
|
1
2
3
4
5
| telnet 10.244.229.217 15012
Trying 10.244.229.217...
Connected to 10.244.229.217.
Escape character is '^]'.
^CConnection closed by foreign host.
|
访问服务的 Pod 是可以连通的,但是访问不了 Service。
1
2
| telnet 10.99.225.56 15012
Trying 10.99.225.56...
|
2. 问题分析
2.1 查看 kube-apiserver 日志
1
| kubectl -n kube-system logs kube-apiserver-ai-kas-master-01 --tail 100 -f
|
1
| E0214 07:03:19.604150 1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
|
2.2 查看节点 kube-proxy 日志
1
| kubectl -n kube-system logs kube-proxy-6c9gr -f
|
1
2
3
4
5
6
7
| E0328 05:01:29.303620 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized
W0328 05:01:59.745555 1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.EndpointSlice: Unauthorized
E0328 05:01:59.745603 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized
W0328 05:02:09.815386 1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.Service: Unauthorized
E0328 05:02:09.815433 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized
W0328 05:02:34.999987 1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.EndpointSlice: Unauthorized
E0328 05:02:35.000026 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized
|
看起来是 kube-proxy 没有将 Service 的信息更新到 iptables 规则中,导致 Pod 无法访问 Service。
2.3 查看 kube-proxy 凭证配置
1
| kubectl -n kube-system exec -it kube-proxy-6c9gr cat /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| ---
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTIzMDUxODA5MjUyOVoXDTMzMDUxNTA5MjUyOVowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAK7m
ZOYNrulW7CrJrJIG1UAojwfVpbC4nT3sclaCLhn/RsdMWrCcjOzxVUVV7fNyhOU1
dGBuja8OVO8191FDioworcXebjdWtDt+35Tas8/J1z3qH4cuLK9T0SIWMnShAOp1
TvE9/gIbDuPDwlqPsCuPANW9DXDmCxbzGwMqFdLLeClEKASITc4a6cPOuFJP4/lp
tZDfA0VuKnXiUFnt31jmIefFaLtZDbY3v5ry+ubrIKxfSmw3PfN/u0/LR+eg1GEG
YGIGBp8Kix/QQzzxhcfNWLRbYmqBJuR5DsXv/qS2ILNR/Jbbfgm7HiA3JKP+7pDr
56jaVDb4LcTv/9bKQAsCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wHQYDVR0OBBYEFOAoDQT7ZFaOU6QpRsUE0xGN0XDeMBUGA1UdEQQO
MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJ5IvgmCUPlwLL94Joll
i9YDla8pWFXemBub/aNsN7ub6bSerYH8vs1vS/ooerVSEojmC75HOPo1zq55s0iK
gpaLQtgmFYtt6GGDzhzjwg5BFEu4f7SO24aY2WCmbwsmYrLSNfeoVOw+02ammAw+
MwwdlaeNcV1UGYQSSYXM4L0F032SIqTVJgrM6uTKWHmdCutRIXAVPLgXGhIl1yaM
HXJVqstshnqR5GC/EVIx9e1onb518ItnpHwSnJaRZerV7itznu2SVYQQMksm1hTn
hElvYLbtWwM99NwWDVMz8F5TiO7y5xTa/3lUXzDvgIiTz8szOFgC5iJFtEnfEMXu
IOI=
-----END CERTIFICATE-----
|
将证书保存为 kube-proxy-temp.crt 文件,然后查看证书的有效期。
1
| openssl x509 -in kube-proxy-temp.crt -noout -enddate
|
1
| notAfter=May 15 09:25:29 2033 GMT
|
1
2
3
| kubectl -n kube-system exec -it kube-proxy-7jf78 cat /var/run/secrets/kubernetes.io/serviceaccount/token
eyJhbGciOiJSUzI1NiIsImtpZCI6IjFjSGk1VWwweHE5cUJiTmhWV0dJTEdnejFEc2xIa21JVjIwOXM3MWVFem8ifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzI2NjU0MjU0LCJpYXQiOjE2OTUxMTgyNTQsImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoia3ViZS1wcm94eS03amY3OCIsInVpZCI6ImIzOTkwNDVmLWFlZjMtNGM3Yy1iOWMwLTVmZmIyZDFhMTJmNCJ9LCJzZXJ2aWNlYWNjb3VudCI6eyJuYW1lIjoia3ViZS1wcm94eSIsInVpZCI6IjAyNmYzZTIyLTkxOTMtNDdkMS04M2IxLWVjNjVjYmY3YjA2NCJ9LCJ3YXJuYWZ0ZXIiOjE2OTUxMjE4NjF9LCJuYmYiOjE2OTUxMTgyNTQsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlLXN5c3RlbTprdWJlLXByb3h5In0.WykJsfpP1wzjiI6Q6AMDmWLbcrPaSy6NGWhhP90Xfz5Oix3rVEEthAITJyjJHEcPBLNxgNuBc6OD3FYW10nBEeTjnv7dcTJnxxKy3q-u1aywOtOjherJOR3jimRclqFAmGf5TgnZ1qpI_UXRw4--K-WDIltRkz5EYXeNStCNsHMAoJdwY-H_l_ZT3MmEKo7zCmsgAuFarSKpuaffG3RirXNZ3SuzosIhbN6KpBQ_uzI9JZOanf7i5-n8fhGR6SMqxCEYhyFvBx4AwXNPjHfCXs7K3yVk3EzrJMr6aifxh86Xzpqs-mN7E1MJGxXilTa03Xd2YlfhCT45D6yjcTdqHQ
|
将 token 保存为 token 文件,然后查看 token 的有效期。
1
2
3
| cat token | cut -d "." -f 2 | base64 -d 2>/dev/null | jq .exp
1726654254
|
1
2
3
| date -d @1726654254
Wed 18 Sep 2024 06:10:54 PM CST
|
已经过期了,但为啥 kubelet 没有自动更新 Token?
3. 问题解决
3.1 重启异常 Pod
1
| kubectl -n kube-system delete pod kube-proxy-6c9gr
|
删除异常 Pod 之后,Pod 中的 token 会重新生成,也就绕过了 Token 过期的问题。
3.2 检查其他 kube-proxy
kube-proxy 直接影响到流量的转发,这里又排查了一遍 kube-proxy 的日志,发现了其他节点也有类似错误。
1
| kubectl -n kube-system logs -l k8s-app=kube-proxy -f --max-log-requests 999 --prefix | grep --line-buffered "Unauthorized"
|
1
2
| [pod/kube-proxy-7jf78/kube-proxy] E0331 01:21:25.284912 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized
[pod/kube-proxy-872lj/kube-proxy] W0331 01:20:00.708171 1 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.Service: Unauthorized
|
3.3 检查全部 Pod Token 有效期
1
2
3
4
5
6
7
8
9
10
11
12
13
| kubectl get pods -A --field-selector=status.phase=Running -o json | jq -r '.items[] | {pod: .metadata.name, namespace: .metadata.namespace, node: .spec.nodeName} | @base64' | while read line; do
data=$(echo $line | base64 --decode)
pod=$(echo $data | jq -r '.pod')
namespace=$(echo $data | jq -r '.namespace')
node=$(echo $data | jq -r '.node')
token=$(kubectl exec -n "$namespace" "$pod" -- cat /var/run/secrets/kubernetes.io/serviceaccount/token 2>/dev/null)
exp=$(echo "$token" | cut -d "." -f2 | base64 -d 2>/dev/null | jq -r .exp 2>/dev/null)
now=$(date +%s)
echo "Pod: $pod, Namespace: $namespace, Node: $node, Exp: $exp"
if [[ "$exp" -lt "$now" && "$exp" != "" ]]; then
echo "Expired Token: Pod=$pod, Node=$node, Namespace=$namespace, Expiry=$(date -d @$exp)"
fi
done
|
主要是 K8s 系统相关的组件会出现这个问题,kube-proxy、kube-controller-manager、kube-scheduler,需要重启 Pod 解决。
4. 总结
本文主要是记录了在指定节点上的 Pod 无法访问 Service 的问题,通过查看 kube-apiserver 和 kube-proxy 的日志,发现是 Token 过期导致的。
相关的解决办法主要有两种,一种是重启 Pod,一种是重启 kubelet。但在这个 Case 中,重启 kubelet 不生效。
另外,按照相关文档如果 client-go 版本小于 v11.0.0 或者 v0.15.0,系统不会自动重新加载并更新 Token,导致 Token 过期风险。而当前 K8s 版本为 v1.23.6,并不在覆盖范围内。
Containerd 的节点没有这个问题,有问题的节点集中在 Docker 环境,因此,也并没有深究这个问题,只是记录下解决办法。
举一反三,如果 kubelet 卡死或者异常,导致 kube-proxy 的 token 无法更新,也会导致流量转发的问题。