laitimes

Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node

author:LinkSLA intelligent operation and maintenance manager

Cluster and environment information:

  • k8s v1.18.4
  • The 3-node masters are all 8-core 16G, 50Gi-SSD
  • Differentially configured 19-node Minion
  • Control-plane components (kube-apiserver, etcd, kube-controller-manager, kube-scheduler) are deployed in static-pod mode
  • The 3 kube-apiserver frontends have a VIP for LoadBalance
  • Tencent Cloud's SSD performance is about 130MB/s

Fault description

On the afternoon of September 10, 2022, "weird" things began to appear: kubectl occasionally stuck CRUDW standard resources (Pods, Nodes, etc.) that could not work normally, at this time it was realized that some kube-apiservers could not work normally, and then tried to connect to the host where 3 kube-apiservers were located, and found the following abnormal conditions:

Information on the spot

K8s Control-Plane Kube-APISERVER Pod Information:

$ kubectl get pods -n kube-system kube-apiserver-x.x.x.x -o yaml

...

  containerStatuses:

  - containerID: docker://xxxxx

    ....

    lastState:

      terminated:

        containerID: docker://yyyy

        exitCode: 137

        finishedAt: "2021-09-10T09:29:02Z"

        reason: OOMKilled

        startedAt: "2020-12-09T07:02:23Z"

    name: kube-apiserver

    ready: true

    restartCount: 1

    started: true

    state:

      running:

        startedAt: "2021-09-10T09:29:08Z"

...           
On September 10, kube-apiserver was killed because of OOM.

Perimeter surveillance

The IaaS layer provides monitoring information (black box monitoring information for the host where the control-plane resides):

Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node
Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node

Valid Information:

  • Memory, CPU, and read disk showed a positive correlation, and began to decline significantly on September 10, and all indicators returned to normal

Monitoring information for Kube-apiserver Prometheus:

Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node

Valid Information:

  • There was a problem with the IO of the kube-apiserver, and Prometheus could not scrape the metrics metrics of the kube-apiserver normally during an event in the middle
  • The memory occupied by kube-apiserver is monotonically increasing, and the ADD IOPS of the internal workqueue is high

Real-time debug information

Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node
Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node
Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node
Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node

Valid Information:

  • The memory of both master nodes is almost 80%~90% occupied
  • A lot of memory is eaten by the kube-apiserver process
  • There is a master component that is particularly special, in addition to memory, the CPU has also been full, and a large number of CPUs are stuck in the kernel state, and the wa is very high
  • Almost every process on the machine is "crazy", a large number of disk reading operations are being performed, and the shell that is accessed is basically unavailable
  • On the only master node with relatively low memory consumption (8Gi at the time), kube-apiserver was once killed by OOM-Kill

Some doubts and related conjectures

Why does kube-apiserver consume a lot of memory?

  1. There are clients working on full List core resources
  2. etcd can not provide normal services, causing kube-apiserver to be unable to provide host selection services for other control-plane components normally, kube-controller-manager, kube-scheduler keep restarting, and constantly ListAndWatch will break the entire machine
  3. The kube-apiserver code has a bug and a memory leak

Why is the etcd cluster not working?

  1. etcd network jitter within the cluster
  2. Disk performance is degraded and normal etcd work is not sufficient
  3. The host where etcd is located lacks computing resources (CPU, RAM), and etcd can only be allocated to a small number of time wheels. And the deadline of the network descriptor set in the code expires

kube-controller-manager, kube-scheduler Why does this stateless service read a large number of disks?

  1. kube-controller-manager, kube-scheduler reads the local configuration file
  2. In cases where operating system memory is extremely tight, memory pages of some processes with large code segments are evicted. Ensure that the scheduled process can run. When the evicted process is scheduled again, it is re-read into memory. This results in increased I/O

Some logs

kube-apiserver related:

I0907 07:04:17.611412       1 trace.go:116] Trace[1140445702]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-b8d912ae5f2cd6d9cfaecc515568c455afdc729fd4c721e4a6ed24ae09d9bcb6,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:16.635225967 +0800 CST m=+23472113.230522917) (total time: 976.1773ms):

Trace[1140445702]: [976.164659ms] [976.159045ms] About to write a response

I0907 07:04:17.611478       1 trace.go:116] Trace[630463685]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-30c942388d7a835b685e5d5a44a15943bfc87b228e3b9ed2fa01ed071fbc1ada,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:16.627643544 +0800 CST m=+23472113.222940496) (total time: 983.823847ms):

Trace[630463685]: [983.812225ms] [983.803546ms] About to write a response

I0907 07:04:17.611617       1 trace.go:116] Trace[1538026391]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-3360a8de64a82e17aa96f373f6dc185489bbc653824b3099c81a6fce754a3f6f,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:16.565151702 +0800 CST m=+23472113.160448667) (total time: 1.046416117s):

Trace[1538026391]: [1.046397533s] [1.046390931s] About to write a response




# ....N条类似的日志....

I0907 07:04:24.171900       1 trace.go:116] Trace[1679666311]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-b5ee4d68917fbf1262cc0a8047d9c37133a4a766f53f3566c40f868edd8a0bf0,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:17.116692376 +0800 CST m=+23472113.711989328) (total time: 5.467094616s):

Trace[1679666311]: [4.69843064s] [4.698422403s] About to write a response

Trace[1679666311]: [5.467093266s] [768.662626ms] Transformed response object

I0907 07:04:24.142154       1 trace.go:116] Trace[2029984150]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-29407a58f021e9911d3ed9b5745dddfd40e61d64e1ecc62cf71ab9e037f83107,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:17.102181932 +0800 CST m=+23472113.697478883) (total time: 5.50528966s):

Trace[2029984150]: [4.704869664s] [4.704865103s] About to write a response

Trace[2029984150]: [5.505288975s] [800.419311ms] Transformed response object

E0907 07:04:37.327057       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]

I0907 07:04:40.391602       1 trace.go:116] Trace[2032502471]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/10.0.0.8,user-agent:kubelet/v1.18.4 (linux/amd64) kubernetes/f8797eb,client:10.0.0.8 (started: 2021-09-07 07:04:17.392952999 +0800 CST m=+23472113.988249966) (total time: 20.164329656s):

Trace[2032502471]: [286.281637ms] [286.281637ms] About to convert to expected version

Trace[2032502471]: [481.650367ms] [195.36873ms] Conversion done

Trace[2032502471]: [674.825685ms] [193.175318ms] About to store object in database

Trace[2032502471]: [20.164329656s] [19.489503971s] END

I0907 07:04:41.502454       1 trace.go:116] Trace[951110924]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-0938f8852ff15a3937d619216756f9bfa5b0d91c0eb95a881b3caccbf8e87fb2,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:17.29109778 +0800 CST m=+23472113.886395069) (total time: 20.41128212s):

Trace[951110924]: [12.839487235s] [12.839478549s] About to write a response

Trace[951110924]: [20.411280569s] [7.571793334s] Transformed response object




W0907 07:10:39.496915       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://etcd0:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing context deadline exceeded". Reconnecting...

I0907 07:10:39.514215       1 clientconn.go:882] blockingPicker: the picked transport is not ready, loop back to repick

E0907 07:10:39.514473       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}

E0907 07:10:39.514538       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}

I0907 07:10:39.514662       1 clientconn.go:882] blockingPicker: the picked transport is not ready, loop back to repick

E0907 07:10:39.514679       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}

E0907 07:10:39.514858       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}

W0907 07:10:39.514925       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://etcd0:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing context deadline exceeded". Reconnecting...           
You can see that the operation on etcd is taking longer and slower. And in the end even lost the connection to etcd

etcd log [has lost the log of September 7]:

{"level":"warn","ts":"2021-09-10T17:14:50.559+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49824","server-name":"","error":"read tcp 10.0.0.8:2380->10.0.0.42:49824: i/o timeout"}

{"level":"warn","ts":"2021-09-10T17:14:58.993+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.6:54656","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:15:03.961+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.6:54642","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:15:32.253+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.6:54658","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:15:49.299+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49690","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:15:55.930+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49736","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:16:29.446+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.6:54664","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:16:24.026+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49696","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:16:36.625+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49714","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:17:54.241+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49878","server-name":"","error":"read tcp 10.0.0.8:2380->10.0.0.42:49878: i/o timeout"}

{"level":"warn","ts":"2021-09-10T17:18:31.712+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49728","server-name":"","error":"EOF"}

{"level":"warn","ts":"2021-09-10T17:18:31.785+0800","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"10.0.0.42:49800","server-name":"","error":"EOF"}           
It can be seen that the communication between the node's etcd and other nodes in the cluster is also abnormal. Normal external services cannot be provided

Dig deeper

Check out the kube-apiserver Heap-Profile

Fire sharing : It is too in-depth to remember the troubleshooting experience of the K8s management node

Get the kube-apiserver's recent profile and find that a lot of memory is registry(*Store). DeleteCollection ate it. DeleteCollection performs a List operation first, and then deletes each Item concurrently. It makes sense to eat a lot of memory in an instant. Maybe I'm so unlucky, this time I caught it just during Delete's crazy call, take a 10-minute break, and then catch it again.

What's going on? How does DeleteCollection still hold so much memory. In fact, suspicions have begun to be suspected that kube-apiserver has goroutine leaks.

It seems that the heap's Profile does not provide enough information for further problem determination, so continue to crawl the profile of the kube-apiserver goroutine

Check out kube-apiserver goroutine-profile

1.8W goroutine

goroutine 18970952966 [chan send, 429 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc00f082ba0, 0xc04e059b90, 0x9, 0x9, 0xc235e37570)

--

goroutine 18971918521 [chan send, 394 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc01f5a7c20, 0xc025cf8750, 0x9, 0x9, 0xc128042620)

--

goroutine 18975541850 [chan send, 261 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc08bb1c780, 0xc095222240, 0x9, 0x9, 0xc128667340)

--

goroutine 18972733441 [chan send, 363 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc015c2c9c0, 0xc0db776240, 0x9, 0x9, 0xc124b0f570)

--

goroutine 18973170460 [chan send, 347 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc07a804c60, 0xc04ee83200, 0x9, 0x9, 0xc23e555c00)

--

goroutine 18974040729 [chan send, 316 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc028d40780, 0xc243ea0cf0, 0x9, 0x9, 0xc14a6be770)

--

goroutine 18974695712 [chan send, 292 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc04573b5c0, 0xc04f0af290, 0x9, 0x9, 0xc1982732d0)

--

goroutine 18974885774 [chan send, 285 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc26065b9e0, 0xc06f025710, 0x9, 0x9, 0xc229945f80)

--

goroutine 18971511296 [chan send, 408 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc021e74480, 0xc0313a54d0, 0x9, 0x9, 0xc1825177a0)

--

goroutine 18975459144 [chan send, 263 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc26cb9d1a0, 0xc17672b440, 0x9, 0x9, 0xc156fcd810)

--

goroutine 18973661056 [chan send, 331 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc2ba3f1680, 0xc19df40870, 0x9, 0x9, 0xc24b888a10)

--

goroutine 18973633598 [chan send, 331 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc24a0ef5c0, 0xc05472a900, 0x9, 0x9, 0xc01ec14af0)

--

goroutine 18974804879 [chan send, 288 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc0863f9ce0, 0xc04fcee090, 0x9, 0x9, 0xc2257a1d50)

--

goroutine 18974203255 [chan send, 310 minutes]:

k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc2727ed8c0, 0xc021d4ae10, 0x9, 0x9, 0xc173af3570)

# ...还有很多,就不列举了...           

Sure enough, a large number of dense goroutines are all blocks in chan send, and they are very time-intensive. The call to the DeleteCollection interface is generally the namespace_deleter of kube-controller-manager. If the kube-apiserver's interface call exception will be backoff, and then continue the request, the exception here is that the communication with kube-apiserver is fine, but kube-apiserver cannot interact with etcd normally.

View the logs for kube-controller-manager

E1027 15:15:01.016712       1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: request timed out

I1027 15:15:01.950682       1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition

F1027 15:15:01.950760       1 controllermanager.go:279] leaderelection lost

I1027 15:15:01.962210       1 certificate_controller.go:131] Shutting down certificate controller "csrsigning"           

Although there is no log related to the call to DeleteCollection, the ConfigMap of kube-controller-manager for LeaderElection cannot be refreshed normally. And the error also makes it very clear that kube-apiserver requests etcd to have a timeout.

kube-apiserver DeleteCollection implementation

func (e *Store) DeleteCollection(ctx context.Context, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions, listOptions *metainternalversion.ListOptions) (runtime.Object, error) {




    ...

 listObj, err := e.List(ctx, listOptions)

 if err != nil {

  return nil, err

 }

 items, err := meta.ExtractList(listObj)

 ...

 wg := sync.WaitGroup{}

 toProcess := make(chan int, 2*workersNumber)

 errs := make(chan error, workersNumber+1)




 go func() {

  defer utilruntime.HandleCrash(func(panicReason interface{}) {

   errs <- fmt.Errorf("DeleteCollection distributor panicked: %v", panicReason)

  })

  for i := 0; i < len(items); i++ {

   toProcess <- i

  }

  close(toProcess)

 }()




 wg.Add(workersNumber)

 for i := 0; i < workersNumber; i++ {

  go func() {

   // panics don't cross goroutine boundaries

   defer utilruntime.HandleCrash(func(panicReason interface{}) {

    errs <- fmt.Errorf("DeleteCollection goroutine panicked: %v", panicReason)

   })

   defer wg.Done()




   for index := range toProcess {

       ...

    if _, _, err := e.Delete(ctx, accessor.GetName(), deleteValidation, options.DeepCopy()); err != nil && !apierrors.IsNotFound(err) {

     klog.V(4).InfoS("Delete object in DeleteCollection failed", "object", klog.KObj(accessor), "err", err)

     errs <- err

     return

    }

   }

  }()

 }

 wg.Wait()

 select {

 case err := <-errs:

  return nil, err

 default:

  return listObj, nil

 }

}           
If e. Delete occurs an exception (that is, the etcd exception in our scenario). The worker goroutine will exit normally, but the task assignment goroutine will not exit normally, and the block is in the code block that sends the toProcess chan. If the Goroutine cannot be terminated, then the items obtained through etcd's List interface cannot be GCed and accumulate in memory, resulting in OOM.

summary

1. Before troubleshooting, you need to have a relatively clear positioning of normal, what is "normal"?

Based on previous practical experience: 100Node, 1400Pod, 50 Configmap, 300 event. kube-apiserver consumes about 2 gigabytes of memory and 10% of the computing resources of a single core.

2. The troubleshooting process is as follows:

  • Through some abnormal manifestations, the presence of the problem is felt
  • Roughly identify the component that caused the failure, and obtain information about the component through the management tool
  • Look for the time of abnormal occurrence in the relevant monitoring system, and extract relevant effective information such as CPU, RAM, disk consumption
  • Make assumptions about what triggered the failure
  • Find logs of related components, profiles to confirm assumptions

3. How to prevent the control-plane chain crash

  • Explicitly set the consumption of CPU resources and memory resources for kube-apiserver. Prevent the normal operation of etcd because kube-apiserver consumes a lot of memory resources in mixed mode
  • The etcd cluster is deployed independently of other components of the control-plane

Original link: https://github.com/k8s-club/k8s-club/blob/main/articles/%E6%8A%93%E8%99%AB%E6%97%A5%E8%AE%B0%20-%20kube-apiserver.md