laitimes

Scale Kubernetes to over 4k nodes and 200k pods

Author | Abdul Qadeer

Translated by | Hirakawa

Planning | Tina

In PayPal, we recently started testing the waters of Kubernetes. Most of our workloads are running on Apache Mesos, and as part of the migration, we need to understand the performance of the Kubernetes cluster and the PayPal-specific control plane. The main thing is to understand the scalability of the platform and find out where you can improve by tuning the cluster.

This article was originally published on the PayPal Technology Blog.

Unlike Apache Mesos, which scales to 10,000 nodes without any modifications, scaling Kubernetes is challenging. The scalability of Kubernetes is not only in the number of nodes and pods, but also in many other aspects, such as the number of resources created, the number of containers per pod, the total number of services, and the throughput of pod deployments. This article describes some of the challenges we faced during the scaling process and how we addressed them.

Cluster topology

Our production environment has clusters of various sizes, containing thousands of nodes. Our setup consists of three masternodes and an external three-node etcd cluster, all of which run on Google Cloud Platform (GCP). There is a load balancer in front of the control plane, and all data nodes belong to the same area as the control plane.

Workload

For performance testing, we used an open source workload generator, k-bench, and made modifications to our scenario. The resource objects we use are simple pods and deployments. We deploy them in batches in succession based on different batch sizes and deployment intervals.

Expansion

In the beginning, the number of pods and nodes was relatively small. Through stress testing, we identified areas for improvement and continued to scale up the cluster as we observed performance improvements. Each worker node has four CPU cores and can accommodate up to 40 pods. We scaled up to about 4100 nodes. The application used for benchmarking is a stateless service that runs on 100 quality of service (QoS) guaranteed millicores.

We start with 1000 nodes, 2000 pods, then 16000 pods, then 32000 pods. After that, we jumped to 4,100 nodes, 150,000 pods, and then 200,000 pods. We had to increase the number of cores per node to accommodate more pods.

API server

The API server has proven to be a bottleneck, with several connections to the API server returning a 504 gateway timeout, in addition to local client throttling (exponential backoff). These issues grow exponentially as they expand:

I0504 17:54:55.731559 1 request.go:655] Throttling request took 1.005397106s, request: POST:https://:443/api/v1/namespaces/kbench-deployment-namespace-14/Pods..

I0504 17:55:05.741655 1 request.go:655] Throttling request took 7.38390786s, request: POST:https://:443/api/v1/namespaces/kbench-deployment-namespace-13/Pods..

I0504 17:55:15.749891 1 request.go:655] Throttling request took 13.522138087s, request: POST:https://:443/api/v1/namespaces/kbench-deployment-namespace-13/Pods..

I0504 17:55:25.759662 1 request.go:655] Throttling request took 19.202229311s, request: POST:https://:443/api/v1/namespaces/kbench-deployment-namespace-20/Pods..

I0504 17:55:35.760088 1 request.go:655] Throttling request took 25.409325008s, request: POST:https://:443/api/v1/namespaces/kbench-deployment-namespace-13/Pods..

I0504 17:55:45.769922 1 request.go:655] Throttling request took 31.613720059s, request: POST:https://:443/api/v1/namespaces/kbench-deployment-namespace-6/Pods..

The size of the rate-limiting queue on the API server is updated via max-mutating-requests-inflight and max-requests-inflight. The beta of the priority and fairness feature introduced in version 1.20 is to divide the total size of queues between different queue categories under the control of these two tokens on the API server. For example, a group election request has a higher priority than a Pod request. Within each priority, there is the fairness of configurable queues. In the future, it can be further tuned through the PriorityLevelConfiguration & FlowSchema API object.

Controller Manager

The Controller Manager is responsible for providing controllers for local resources such as replica sets, namespaces, and a large number of deployments that are managed by replica sets. The controller manager can synchronize its state with the API server at a limited speed. There are several regulators for tuning this behavior:

kube-api-qps - The number of times the controller manager can query the API server in one second.

kube-api-burst - The controller manager burst peak is the number of other concurrent calls above the kube-api-qps.

concurrent-deployment-syncs - the concurrency of synchronous calls to objects such as deployments, replica sets, etc.

Scheduler

When tested separately as a stand-alone component, the scheduler can support a high throughput rate of 1000 pods per second. However, when deploying the scheduler to an online cluster, we noticed that the actual throughput was reduced. The slow etcd instance causes the scheduler's binding latency to increase, increasing the size of the pending queue to the point of thousands of pods. The idea was to keep this value below 100 during the test run, because a larger number would affect the startup delay of the Pod. In addition, we finally adjusted the group leader election parameters to cope with short network partitions or spurious restarts caused by network congestion.

etcd

etcd is the most critical part of a Kubernetes cluster. This can be seen in the large number of problems that etcd raises throughout the cluster and manifests itself in different ways. After very careful research, we found the root cause and extended etcd to match the size we expected.

During the scaling process, many Raft proposals start to fail

Scale Kubernetes to over 4k nodes and 200k pods

Through investigative analysis, we found that GCP limits the throughput of PD-SSD disks to about 100MB per second (as shown in the figure below), and our disk size is 100G. GCP does not provide a way to increase the throughput limit—it only increases with the size of the disk. Although the etcd node requires less than 10G of space, we first tried 1TB of PD-SSD. However, when all 4K nodes join the Kubernetes control plane at the same time, the disk size can become a bottleneck. We decided to use a local SSD, which has a very high throughput at the cost of losing data in the event of a failure that is slightly higher because it is not persistent.

Scale Kubernetes to over 4k nodes and 200k pods

After migrating to on-premises SSDs, we didn't see the fastest SSDs delivering the expected performance. We did some benchmarks directly on disk with FIO, and the values were to be expected. However, for write concurrency for all members, the etcd benchmark tells a different story:

Plain TextLOCAL SSDSummary: Total: 8.1841 secs. Slowest: 0.5171 secs. Fastest: 0.0332 secs. Average: 0.0815 secs. Stddev: 0.0259 secs. Requests/sec: 12218.8374

PD SSD

Summary:

Total: 4.6773 secs.

Slowest: 0.3412 secs.

Fastest: 0.0249 secs.

Average: 0.0464 secs.

Stddev: 0.0187 secs.

Requests/sec: 21379.7235

Local SSDs perform worse! After further investigation, this was caused by the write barrier cache commit of the ext4 file system. Because etcd uses the pre-write log and calls fsync each time it commits to the Raft log, you can disable the write barrier. In addition, we have DB backup jobs at the file system level and application level for DR. With this modification, the value of using the local SSD has been increased to a level comparable to that of the PD-SSD:

Plain TextLOCAL SSDSummary: Total: 4.1823 secs. Slowest: 0.2182 secs. Fastest: 0.0266 secs. Average: 0.0416 secs. Stddev: 0.0153 secs. Requests/sec: 23910.3658

This improvement is reflected in etcd's WAL synchronization duration and back-end commit delay, as shown in the following figure, where the WAL synchronization duration and back-end commit latency are reduced by more than 90% at the point in time around 15:55

The default MVCC database size in etcd is 2GB. When a DB out-of-space alarm is triggered, this size is increased up to 8GB. Since the database is about 60% utilized, we were able to scale to 200,000 stateless pods.

After these optimizations, the cluster is more stable at the expected scale, however, our SLI is much worse in terms of API latency.

The etcd server also restarts occasionally, and a single reboot can corrupt the benchmark results, especially the P99 value. A closer look reveals that the v1.20 version of etcd YAML has a survivor probe bug. To solve this problem, we adopted a workaround that would increase the count of the failure threshold.

After exhausting all methods to scale etcd vertically, mainly in terms of resources (CPU, memory, disk), we found that etcd's performance was affected by range queries. When there are many range queries, etcd does not perform well, and writes to Raft logs are also affected, increasing the latency of the cluster. The following are the number of range queries for each Kubernetes resource that affected performance in a test run:

Plain Textetcd$ sudo grep -ir "events" 0.log.20210525-035918 | wc -l130830etcd$ sudo grep -ir "Pods" 0.log.20210525-035918 | wc -l107737etcd$ sudo grep -ir "configmap" 0.log.20210525-035918 | wc -l86274etcd$ sudo grep -ir "deployments" 0.log.20210525-035918 | wc -l6755etcd$ sudo grep -ir "leases" 0.log.20210525-035918 | wc -l4853etcd$ sudo grep -ir "nodes" 0.log.20210525-035918 | wc -l

Because these queries are time-consuming, etcd's back-end latency is greatly affected. After sharding the etcd server on the event resource, we see that the stability of the cluster has improved in the case of high pod competition. In the future, the etcd cluster can be further sharded on pod resources. It's easy to configure the API server to contact the relevant etcd to interact with the sharded resources.

Results

After optimizing and tweaking the various components of Kubernetes, we observed a significant improvement in latency. The following graph shows the performance gains achieved over time to meet the SLO. The workload is 150k pods, each deploying 250 replicas and 10 concurrent worker processes. As long as the P99 delay for pod startup is within 5 seconds, we're fine with Kubernetes SLO

Scale Kubernetes to over 4k nodes and 200k pods

The following diagram shows that when the cluster has 200,000 pods, the API call latency is exactly SLO compliant.

We also achieved a 200,000 Pod P99 startup delay of around 5 seconds, and the pod deployment rate was much higher than the 3,000 pods/minute claimed by K8s when testing for 5K nodes.

Scale Kubernetes to over 4k nodes and 200k pods

Summary

Kubernetes is a complex system that requires an in-depth understanding of the control plane to know how to scale each component. Through this exercise, we learned a lot and will continue to optimize our cluster.

View the original English text:

https://medium.com/paypal-tech/scaling-kubernetes-to-over-4k-nodes-and-200k-pods-29988fad6ed?

Read on