天天看點

K8S問題排查-更新K8S後apiserver的token超期問題

作者:雲原生知識星球

問題背景

K8S叢集環境穩定運作一年後,pod重新開機卡在ContainerCreating狀态:

[root@node1 ~]# kubectl get pod -n kube-system -owide
NAME                                      READY   STATUS    RESTARTS   AGE    IP                NODE    
calico-kube-controllers-cd96b6c89-bpjp6   1/1     Running       0          40h    10.10.0.1     node3
calico-node-ffsz8                         1/1     Running       0          14s    10.10.0.1     node3
calico-node-nsmwl                         1/1     Running       0          14s    10.10.0.2     node2
calico-node-w4ngt                         1/1     Running       0          14s    10.10.0.1     node1
coredns-55c8f5fd88-hw76t                  1/1     Running     1          260d   192.168.135.55    node3
xxx-55c8f5fd88-vqwbz                      1/1     ContainerCreating 1          319d   192.168.104.22    node2           

分析過程

describe檢視

[root@node1 ~]# kubectl describe pod -n xxx xxx
Events:
  Type     Reason                  Age                 From               Message
  ----     ------                  ----                ----               -------
  Normal   Scheduled               52m                 default-scheduler  Successfully assigned xxx/xxx to node1
  Warning  FailedCreatePodSandBox  52m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to set up pod "xxx" network: connection is unauthorized: Unauthorized, failed to clean up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to teardown pod "xxx" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
  Normal   SandboxChanged          50m (x10 over 52m)  kubelet            Pod sandbox changed, it will be killed and re-created.           

事件裡顯示的Unauthorized,也就是因為無權限從kube-apiserver中擷取相關資訊,檢視對應pod使用的token,發現确實存在過期時間相關的定義:

{
 alg: "RS256",
 kid: "nuXGyK2zjFNBRnO1ayeOxJDm_luMf4eqQFnqJbsVl7I"
}.
{
 aud: [
  "https://kubernetes.default.svc.cluster.local"
 ],
 exp: 1703086264, // 時間過期的定義,一年後該token過期
 iat: 1671550264,
 nbf: 1671550264,
 iss: "https://kubernetes.default.svc.cluster.local",
 kubernetes.io: {
  namespace: "kube-system",
  pod: {
   name: "xxx",
   uid: "c7300d73-c716-4bbc-ad2b-80353d99073b"
  },
  serviceaccount: {
   name: "multus",
   uid: "1600e098-6a86-4296-8410-2051d45651ce"
  },
  warnafter: 1671553871
 },
 sub: "system:serviceaccount:kube-system:xxx"
}.           

檢視相關issue[1,2,3],基本确認是**k8s版本疊代引起的**,為了提供更安全的token機制,從v1.21版本開始,BoundServiceAccountTokenVolume特性進入beta版本,并預設啟用。

解決方案

  1. 如果不想使用該特性,可以按照下面提供的方法[4],對kube-apiserver和kube-controller-manager元件添加feature gate禁用即可。
1. How can this feature be enabled / disabled in a live cluster?
 Feature gate name: BoundServiceAccountTokenVolume
 Components depending on the feature gate: kube-apiserver and kube-controller-manager
 Will enabling / disabling the feature require downtime of the control plane? yes, need to restart kube-apiserver and kube-controller-manager.
 Will enabling / disabling the feature require downtime or reprovisioning of a node? no.
2. Does enabling the feature change any default behavior? yes, pods' service account tokens will expire after 1 year by default and are not stored as Secrets any more.           
  1. 如果需要使用該特性,則要求使用token的一方适配修改,做到不緩存或者token失效後支援自動重新整理新的token到記憶體即可,已知新版本的client-go和fabric8用戶端均已支援。

參考資料

  1. https://github.com/k8snetworkplumbingwg/multus-cni/issues/852
  2. https://github.com/projectcalico/calico/issues/5712
  3. https://www.cnblogs.com/bystander/p/rancher-jian-kong-bu-xian-shi-jian-kong-shu-ju-wen.html
  4. https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md

繼續閱讀