1. Preamble

In the current wave of cloud-native technology development, Kubernetes has become the undisputed leader, and its superior container orchestration capabilities greatly facilitate developers and operations teams.

However, the technology underpinning Kubernetes is far more complex than it seems. As an important feature of the Linux kernel, Cgroups is widely used by Kubernetes for resource management at all levels. Through Cgroups, Kubernetes can fine-grained control and management of resources such as CPU and memory, which significantly improves the availability, stability, and performance of the system.

This article will provide an in-depth discussion of how Kubernetes leverages Cgroups to manage resources and provide some recommended configurations and best practices.

2. Introduction to Cgroups

Cgroups, also known as Control Groups, allow you to limit various computing resources on a specified process, including but not limited to CPU usage, memory usage, and traffic to IO devices.

What are the advantages of Cgroups? Before the advent of Cgroups, any process could create hundreds, if not thousands, of threads, which could easily drain all of a computer's CPU and memory resources.

However, with the introduction of Cgroups, we were able to limit the resource consumption of a single process or a group of processes.

Cgroups throttle different resources through different subsystems, each of which is responsible for restricting a particular resource. These subsystems work in a similar way: related processes are assigned to a control group and managed through a tree-like structure, each with its own custom resource control parameters.

subsystem	function
cpu	Manage the CPU usage of processes in cgroups
cpuacct	Statistics on the CPU usage of processes in cgroups
cpuset	Assign separate CPU and memory nodes to tasks in cgroups
memory	Manage the memory usage of processes in cgroups
blkio	Block device IO that manages processes in cgroups
devices	Control the access of processes in cgroups to certain devices
...	...

Table 1 Common cgroups subsystems

In the operating system, we can use the following command to view the Cgroups mount directory

$ mount | grep cgroup


tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

In the root directory of Cgroups, we can see the files for each subsystem

$ ll /sys/fs/cgroup


total 0
drwxr-xr-x 6 root root  0 Apr 16 05:13 blkio
lrwxrwxrwx 1 root root 11 Apr 16 05:13 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Apr 16 05:13 cpuacct -> cpu,cpuacct
drwxr-xr-x 6 root root  0 Apr 16 05:13 cpu,cpuacct
drwxr-xr-x 5 root root  0 Apr 16 05:13 cpuset
drwxr-xr-x 5 root root  0 Apr 16 05:13 devices
drwxr-xr-x 4 root root  0 Apr 16 05:13 freezer
drwxr-xr-x 5 root root  0 Apr 16 05:13 hugetlb
drwxr-xr-x 6 root root  0 Apr 16 05:13 memory
lrwxrwxrwx 1 root root 16 Apr 16 05:13 net_cls -> net_cls,net_prio
drwxr-xr-x 4 root root  0 Apr 16 05:13 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Apr 16 05:13 net_prio -> net_cls,net_prio
drwxr-xr-x 4 root root  0 Apr 16 05:13 perf_event
drwxr-xr-x 6 root root  0 Apr 16 05:13 pids
drwxr-xr-x 6 root root  0 Apr 16 05:13 systemd

3. Cgroups驱动介绍及选型

3.1 Cgroups 驱动介绍

There are two main types of Cgroups drivers:

systemd
cgroupfs

systemd:

systemd is a system and service manager that is widely used as an init system for many modern Linux distributions.
systemd直接通过其控制接口管理cgroups,并通过工具(如 systemctl、systemd-cgls、systemd-cgtop等)提供cgroups的管理和监控功能。

cgroupfs:

cgroupfs manages cgroups through file system APIs.
cgroupfs mounts cgroups to the /sys/fs/cgroup directory as a file system, allowing users to manage cgroups through file and directory operations, similar to operations on a normal file system.

The main difference between systemd drivers and cgroupfs drivers is their management interfaces. systemd provides a full set of specialized tools and interfaces to manage cgroups, while cgroupfs manages through the file system interface. In practice, although most modern Linux distributions use systemd to manage cgroups, cgroupfs still exists as an option supported by the Linux kernel.

3.2 Kubernetes 中 Cgroups 驱动选择

By default, Kubelet uses cgroupfs as the Cgroups driver. However, if you deploy a Kubernetes cluster (version 1.22 or later) via kubeadm, systemd will be used as the Cgroups driver by default.

We can't have both cgroupfs and systemd as our wings, we have to choose one or the other, so how should we choose?

In fact, it is very simple, which Cgroups driver is used by the operating system, we will choose which.

Because if there are two types of Cgroups managers in a system, assuming that the kubelet and container runtimes (e.g. containerd) use cgroupfs, and the operating system uses systemd, the node may become unstable as the resource pressure increases. (Shuraba)

For most modern Linux distributions, such as RHEL, CentOS, Fedora, Ubuntu, Debian, etc., they all employ systemd to manage cgroups. Therefore, on these Linux distributions, kubelet and container runtimes should choose systemd as the Cgroups driver.

4. How Kubernetes uses Cgroups to manage resources

In Kubernetes, Cgroups are used to manage resources at different levels, from largest to smallest:

Node
Under
Container

Next, we will describe the management of Cgroups in Kubernetes at the three levels of node, pod, and container. We will also discuss how Kubernetes leverages QoS (Quality of Service) policies to optimize resource allocation and management.

4.1 Node Level

In a node of a Kubernetes cluster, resources can be divided into three parts:

Resources used by system components: resources used by processes in the operating system itself, such as sshd, systemd, etc
Resources used by Kubernetes components: e.g. kubelets, container runtimes (e.g. Docker, containerd), etc
Resources used by pods: resources used by pods in the cluster

If we do not manage the resource usage of these three parts, node resources may be strained, which may lead to an avalanche of clusters.

Imagine the following scenario:

The pod glutton on node A in the cluster eats up all the resources on the node
Due to resource competition, system components and Kubernetes components crash because they cannot grab resources
节点A状态变为Not Ready
Kubernetes attempts to migrate the glutton pod glutton on node A to another node that is in the Ready state, such as node B
The resources on node B are then also fully occupied by the pod glutton
...

Originally, both groups and people could run, but now only people can run.

Then, for the health of the cluster, and more importantly, for the health of people (jobs), you may have thought of a smart way: reserve resources for system components and Kubernetes components in advance to prevent resources from being completely occupied by malicious or abnormal pods.

Coincidentally, the designers of Kubernetes thought so too.

4.1.1 Node-level resource management

At the Node level, Kubernetes allows us to modify the following configurations to reserve and limit resources:

systemReserved
kubeReserved
evictionHard
enforceNodeAllocatable

systemReserved

Resources reserved for system components, such as SSHD.

If this option is configured, you need to also configure systemReservedCgroup to specify the Cgroups path of the system component.

At the same time, you need to check/create the path of systemReservedCgroup before the kubelet starts, otherwise the kubelet will fail to start.

Example:

systemReserved:
  cpu: 100m
  memory: 2048Mi
  ephemeral-storage: 1Gi
  pid: "1000"
systemReservedCgroup: /system.slice

kubeReserved

Resources reserved for Kubernetes components, such as kubelets, container runtimes, etc.

If this option is configured, you need to configure kubeReservedCgroup to specify the Cgroups path of the Kubernetes component.

At the same time, you need to check/create the path of kubeReservedCgroup before the kubelet starts, otherwise the kubelet will fail to start.

Example:

kubeReserved:
  cpu: 100m
  memory: 2048Mi
  ephemeral-storage: 1Gi
  pid: "1000"
kubeReservedCgroup: /kube.slice

evictionHard

As a condition for node pressure eviction, kubelet monitors resources such as memory, disk space, and file system inodes on cluster nodes. When one or more of these resources meet the specified conditions, kubelet actively terminates the pod to reclaim the node resource.

Example:

evictionHard:
  imagefs.available: 10%
  memory.available: 500Mi
  nodefs.available: 5%
  nodefs.inodesFree: 5%

enforceNodeAllocatable

Specify the type of resources to be restricted, which is pods by default, that is, only the resources used by pods on the node are restricted.

In addition, it is possible to limit the resources of system components (system-reserved) and Kubernetes components (kube-reserved).

In order to ensure the stable operation of core components, it is generally recommended not to limit the resources of system components and Kubernetes components to avoid a situation where these core components are underresourced.

Example:

enforceNodeAllocatable:
-pods

After the above configuration, the total number of resources that can be used by pods on a node (Allocatable) is as follows:

节点资源总量(Capacity) - evictionHard - kubeReserved - systemReserved

As shown below:

The art of resource management with Kubernetes

Figure 4.1 The total amount of resources that can be used by pods on a node

4.1.2 Node 级别的 Cgroups 层级

The Cgroups hierarchy directory is as follows (taking the CPU subsystem as an example)

Figure 4.2 Cgroups hierarchical directory

In the image above:

/system.slice: manages the resource usage of the system component
/kube.slice: manages the resource usage of Kubernetes components, such as kubelets and container runtimes
/kubepods.slice:管理该节点上Kubernetes集群中Pod的资源使用

4.2 QoS level

QoS，全称Quality of Service，即服务质量。

Kubernetes classifies all pods and assigns them to specific QoS levels based on their resource requests and limits.

In the case of node resource pressure, Kubernetes decides which pods should be evicted based on the QoS level.

4.2.1 QoS Ratings

There are 3 levels of Qo:

Guaranteed
Burstable
BestEffort

Guaranteed

All containers (excluding temporary containers) in a pod have Requests and Limits set for CPU and memory resources, and the values of the requests and limits are equal.

Burstable

The conditions of the Guaranteed level are not met, but at least one container has Requests or Limits set for CPU and memory resources.

BestEffort

Neither the Guaranteed nor Burstable conditions are met, i.e., none of the containers in the pod have set Requests or Limits for CPU and memory resources.

The pod importance levels of the three QoS levels are as follows:

Guaranteed > Burstable > BestEffort

That is, when the node resources are exhausted, the pods of the BestEffort level will be evicted first, then the pods of the Burstable level, and finally the pods of the Guaranteed level. When this eviction is due to resource pressure, only pods that exceed the resource request are candidates for eviction.

4.2.2 QoS 级别的 Cgroups 层级

The Cgroups hierarchy directory is as follows (taking the CPU subsystem as an example)

Figure 4.3 Cgroups hierarchical directory

如图所示，Guaranteed等级的Pod的Cgroups目录在最外层，Burstable 和BestEffort等级的Cgroups目录则是在子目录。

4.3 Pod Level

4.3.1 Pod-level resource management

A pod can contain one or more containers. However, the total resource consumption of a pod is not limited to the sum of the resource consumption of all containers within the pod. In fact, the resource usage of pods also includes additional overhead, including:

Pod basic resource overhead: Each pod incurs some basic resource overhead, which mainly includes network configuration and resource consumption of some basic services running inside the pod.
Kubernetes system containers: Kubernetes runs a special container in each pod, commonly known as a pause container. This container serves as the parent container for all other containers and is responsible for managing tasks such as networking, volume mounting, and so on. Although the pause container consumes relatively few resources, it also needs to be taken into account when calculating the total resource consumption of pods.

In order to effectively manage the resource consumption of pods, Kubernetes establishes a Cgroups layer at the pod level, which can more precisely control and monitor the resource usage of pods and their containers.

4.3.2 Pod 级别的 Cgroups 层级

The Cgroups hierarchy directory is as follows (taking the CPU subsystem as an example)

Figure 4.4 Cgroups hierarchical directory

4.4 Container级别

4.4.1 Resource management at the container level

Resource isolation at the container level is achieved by the underlying container runtime, such as containerd. In a Kubernetes environment, resource requests and limits are configured for a container, and these configuration information are passed to the container runtime through kubelets. This process relies on Kubernetes' Container Runtime Interface (CRI), which provides a standardized protocol and toolset for the interaction between kubelets and container runtimes.

After receiving the resource configuration information, the container runtime uses Cgroups to limit and isolate the resource usage of each container. By configuring Cgroups, the container runtime can precisely control resources such as CPU and memory. Specifically, it configures parameters such as cpu.shares, cpu.cfs_period_us, cpu.cfs_quota_us, and memory.limit_in_bytes to ensure that the container's resource consumption does not exceed the preset limit.

In general, kubelet is responsible for passing resource configuration information, while specific resource isolation and throttling are implemented by configuring cgroups by the container runtime. This design enables Kubernetes to manage container resources efficiently and securely.

4.4.2 CPU Resource Configuration

CPU request: Implemented through cpu.shares in Cgroups. When the container's CPU request is set to x millicores, its cpu.shares value is calculated as x * 1024 / 1000. For example, if the container's CPU request is 1000 millicores (i.e., equivalent to 1 CPU core), then the value of cpu.shares will be 1024. Doing so ensures that the container gets at least the equivalent of 1 CPU core, even when CPU resources are tight.
CPU limit: This is implemented through the cpu.cfs_period_us and cpu.cfs_quota_us of Cgroups. In Kubernetes, cpu.cfs_period_us is set to 100,000 microseconds (i.e., 100 milliseconds), and cpu.cfs_quota_us is calculated based on CPU throttling. These two parameters work together to strictly limit the CPU usage of the container to ensure that the set limit is not exceeded. If only request is specified without limit, then the cpu.cfs_quota_us will be set to -1 (meaning there is no limit). If neither request nor limit is specified, then cpu.shares will be set to 2, meaning that the CPU resources allocated to that container will be minimal.

4.4.3 Allocation of memory resources

Memory requests: Kubernetes takes memory requests into account when making scheduling decisions, and while not directly reflected in the container-level Cgroups configuration, it ensures that sufficient memory resources are allocated to the container on the node. Memory requests are primarily used in Kubernetes' scheduling process to guarantee that nodes have the memory resources they need to allocate containers.
Memory limit: The memory limit is implemented through the memory.limit_in_bytes parameter of Cgroups, which defines the maximum amount of memory that can be used by the container process. If the memory usage of the container exceeds this limit, OOM Killer will be triggered, which may cause the container to be terminated and restarted. If no memory limit is specified, the memory.limit_in_bytes is set to a maximum value, which usually means that the container can use memory indefinitely.

In a nutshell, Kubernetes implements resource isolation for containers through Cgroups configuration, ensuring that each container gets a reasonable allocation of resources according to its requests and limits. This mechanism not only improves the efficiency of resource utilization, but also prevents excessive resource consumption of any single container, thus ensuring the overall stability of the system.

4.4.4 Container 级别的 Cgroups 层级

Cgroups 层级目录如下（以 CPU 子系统为例）

Figure 4.5 Cgroups hierarchical directory

Finally, let's take a look at the Cgroup directory structure of the CPU subsystem.

Figure 4.6 Cgroups hierarchy of the CPU subsystem

Resources

[1] Kubernetes 官方文档(https://kubernetes.io/docs/home/)

Author: Liuyue

Source-WeChat public account: Xinye Technology shoots black rice

Source: https://mp.weixin.qq.com/s/BiMFUkZ47wo5znBhgAnuPA

The art of resource management with Kubernetes

1. Preamble

2. Introduction to Cgroups

3. Cgroups驱动介绍及选型

4. How Kubernetes uses Cgroups to manage resources

4.1 Node Level

4.1.1 Node-level resource management

systemReserved

kubeReserved

enforceNodeAllocatable

4.1.2 Node 级别的 Cgroups 层级

The Cgroups hierarchy directory is as follows (taking the CPU subsystem as an example)

4.2 QoS level

4.2.1 QoS Ratings

Guaranteed

Burstable

BestEffort

4.2.2 QoS 级别的 Cgroups 层级

4.3 Pod Level

4.3.1 Pod-level resource management

4.3.2 Pod 级别的 Cgroups 层级

4.4 Container级别

4.4.1 Resource management at the container level

4.4.2 CPU Resource Configuration

4.4.3 Allocation of memory resources

4.4.4 Container 级别的 Cgroups 层级

Resources

Read on