淺析 Kubelet 驅逐機制

本文基于對 Kubernetes v1.22.1 的源碼閱讀

Kubelet 出于對節點的保護，允許在節點資源不足的情況下，開啟對節點上 Pod 進行驅逐的功能。最近對 Kubelet 的驅逐機制有所研究，發現其中有很多值得學習的地方，總結下來和大家分享。

Kubelet 的配置

Kubelet 的驅逐功能需要在配置中打開，并且配置驅逐的門檻值。Kubelet 的配置中與驅逐相關的參數如下：

type KubeletConfiguration struct {
    ...
    // Map of signal names to quantities that defines hard eviction thresholds. For example: {"memory.available": "300Mi"}.
    EvictionHard map[string]string
    // Map of signal names to quantities that defines soft eviction thresholds.  For example: {"memory.available": "300Mi"}.
    EvictionSoft map[string]string
    // Map of signal names to quantities that defines grace periods for each soft eviction signal. For example: {"memory.available": "30s"}.
    EvictionSoftGracePeriod map[string]string
    // Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition.
    EvictionPressureTransitionPeriod metav1.Duration
    // Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
    EvictionMaxPodGracePeriod int32
    // Map of signal names to quantities that defines minimum reclaims, which describe the minimum
    // amount of a given resource the kubelet will reclaim when performing a pod eviction while
    // that resource is under pressure. For example: {"imagefs.available": "2Gi"}
    EvictionMinimumReclaim map[string]string
    ...
}

其中，EvictionHard 表示硬驅逐，一旦達到門檻值，就直接驅逐；EvictionSoft 表示軟驅逐，即可以設定軟驅逐周期，隻有超過軟驅逐周期後，才啟動驅逐，周期用 EvictionSoftGracePeriod 設定；EvictionMinimumReclaim 表示設定最小可用的門檻值，比如 imagefs。

可以設定的驅逐信号有：

memory.available：node.status.capacity[memory] - node.stats.memory.workingSet，節點可用記憶體
nodefs.available：node.stats.fs.available，Kubelet 使用的檔案系統的可使用容量大小
nodefs.inodesFree：node.stats.fs.inodesFree，Kubelet 使用的檔案系統的可使用 inodes 數量
imagefs.available：node.stats.runtime.imagefs.available，容器運作時用來存放鏡像及容器可寫層的檔案系統的可使用容量
imagefs.inodesFree：node.stats.runtime.imagefs.inodesFree，容器運作時用來存放鏡像及容器可寫層的檔案系統的可使用 inodes 容量
allocatableMemory.available：留給配置設定 Pod 用的可用記憶體
pid.available：node.stats.rlimit.maxpid - node.stats.rlimit.curproc，留給配置設定 Pod 用的可用 PID

Eviction Manager 工作原理

Eviction Manager的主要工作在

synchronize

函數裡。有兩個地方觸發

synchronize

任務，一個是 monitor 任務，每 10s 觸發一次；另一個是根據使用者配置的驅逐信号，啟動的

notifier

任務，用來監聽核心事件。

notifier

notifier

由 eviction manager 中的

thresholdNotifier

啟動，使用者配置的每一個驅逐信号，都對應一個

thresholdNotifier

，而

thresholdNotifier

和

notifier

通過 channel 通信，當

notifier

向 channel 中發送消息時，對應的

thresholdNotifier

便觸發一次

synchronize

邏輯。

notifier

采用的是核心的 cgroups Memory thresholds，cgroups 允許使用者态程序通過

eventfd

來設定當

memory.usage_in_bytes

達到某門檻值時，核心給應用發送通知。具體做法是向

cgroup.event_control

寫入

"<event_fd> <fd of memory.usage_in_bytes> <threshold>"

。

notifier

的初始化代碼如下（為了友善閱讀，删除了部分不相幹代碼），主要是找到

memory.usage_in_bytes

的檔案描述符

watchfd

，

cgroup.event_control

controlfd

，完成

cgroup memory thrsholds

的注冊。

func NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) {
    var watchfd, eventfd, epfd, controlfd int

    watchfd, err = unix.Open(fmt.Sprintf("%s/%s", path, attribute), unix.O_RDONLY|unix.O_CLOEXEC, 0)
    defer unix.Close(watchfd)
    
    controlfd, err = unix.Open(fmt.Sprintf("%s/cgroup.event_control", path), unix.O_WRONLY|unix.O_CLOEXEC, 0)
    defer unix.Close(controlfd)
    
    eventfd, err = unix.Eventfd(0, unix.EFD_CLOEXEC)
    defer func() {
        // Close eventfd if we get an error later in initialization
        if err != nil {
            unix.Close(eventfd)
        }
    }()
    
    epfd, err = unix.EpollCreate1(unix.EPOLL_CLOEXEC)
    defer func() {
        // Close epfd if we get an error later in initialization
        if err != nil {
            unix.Close(epfd)
        }
    }()
    
    config := fmt.Sprintf("%d %d %d", eventfd, watchfd, threshold)
    _, err = unix.Write(controlfd, []byte(config))

    return &linuxCgroupNotifier{
        eventfd: eventfd,
        epfd:    epfd,
        stop:    make(chan struct{}),
    }, nil
}

notifier 在啟動時還會通過 epoll 來監聽上述的

eventfd

，當監聽到核心發送的事件時，說明使用的記憶體已超過門檻值，便向 channel 中發送信号。

func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) {
    err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{
        Fd:     int32(n.eventfd),
        Events: unix.EPOLLIN,
    })

    for {
        select {
        case <-n.stop:
            return
        default:
        }
        event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval)
        if err != nil {
            klog.InfoS("Eviction manager: error while waiting for memcg events", "err", err)
            return
        } else if !event {
            // Timeout on wait.  This is expected if the threshold was not crossed
            continue
        }
        // Consume the event from the eventfd
        buf := make([]byte, eventSize)
        _, err = unix.Read(n.eventfd, buf)
        if err != nil {
            klog.InfoS("Eviction manager: error reading memcg events", "err", err)
            return
        }
        eventCh <- struct{}{}
    }
}

synchronize

邏輯每次執行都會判斷 10s 内

notifier

是否有更新，并重新啟動

notifier

cgroup memory threshold

的計算方式為記憶體總量減去使用者設定的驅逐門檻值。

synchronize

Eviction Manager 的主邏輯

synchronize

細節比較多，這裡就不貼源碼了，梳理下來主要是以下幾個事項：

針對每個信号建構排序函數；
更新 threshold 并重新啟動 notifier ；
擷取目前節點的資源使用情況(cgroup 的資訊)和所有活躍的 pod；
針對每個信号，分别确定目前節點的資源使用情況是否達到驅逐的門檻值，如果都沒有，則退出目前循環；
将所有的信号進行優先級排序，優先級為：跟記憶體有關的信号先進行驅逐；
向 apiserver 發送驅逐事件；
将所有活躍的 pod 進行優先級排序；
按照排序後的順序對 pod 進行驅逐。

計算驅逐順序

對 pod 的驅逐順序主要取決于三個因素：

pod 的資源使用情況是否超過其 requests；
pod 的 priority 值；
pod 的記憶體使用情況；

三個因素的判斷順序也是根據注冊進

orderedBy

的順序。這裡

orderedBy

函數的多級排序也是 Kubernetes 裡一個值得學習（抄作業）的一個實作，感興趣的讀者可以自行查閱源碼。

// rankMemoryPressure orders the input pods for eviction in response to memory pressure.
// It ranks by whether or not the pod's usage exceeds its requests, then by priority, and
// finally by memory usage above requests.
func rankMemoryPressure(pods []*v1.Pod, stats statsFunc) {
    orderedBy(exceedMemoryRequests(stats), priority, memory(stats)).Sort(pods)
}

驅逐 Pod

接下來就是驅逐 Pod 的實作。Eviction Manager 驅逐 Pod 就是幹淨利落的 kill，裡面具體的實作這裡不展開分析，值得注意的是在驅逐之前有一個判斷，如果

IsCriticalPod

傳回為 true 則不驅逐。

func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool {
    // If the pod is marked as critical and static, and support for critical pod annotations is enabled,
    // do not evict such pods. Static pods are not re-admitted after evictions.
    // https://github.com/kubernetes/kubernetes/issues/40573 has more details.
    if kubelettypes.IsCriticalPod(pod) {
        klog.ErrorS(nil, "Eviction manager: cannot evict a critical pod", "pod", klog.KObj(pod))
        return false
    }
    // record that we are evicting the pod
    m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)
    // this is a blocking call and should only return when the pod and its containers are killed.
    klog.V(3).InfoS("Evicting pod", "pod", klog.KObj(pod), "podUID", pod.UID, "message", evictMsg)
    err := m.killPodFunc(pod, true, &gracePeriodOverride, func(status *v1.PodStatus) {
        status.Phase = v1.PodFailed
        status.Reason = Reason
        status.Message = evictMsg
    })
    if err != nil {
        klog.ErrorS(err, "Eviction manager: pod failed to evict", "pod", klog.KObj(pod))
    } else {
        klog.InfoS("Eviction manager: pod is evicted successfully", "pod", klog.KObj(pod))
    }
    return true
}

再看看

IsCriticalPod

的代碼：

func IsCriticalPod(pod *v1.Pod) bool {
    if IsStaticPod(pod) {
        return true
    }
    if IsMirrorPod(pod) {
        return true
    }
    if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) {
        return true
    }
    return false
}

// IsMirrorPod returns true if the passed Pod is a Mirror Pod.
func IsMirrorPod(pod *v1.Pod) bool {
    _, ok := pod.Annotations[ConfigMirrorAnnotationKey]
    return ok
}

// IsStaticPod returns true if the pod is a static pod.
func IsStaticPod(pod *v1.Pod) bool {
    source, err := GetPodSource(pod)
    return err == nil && source != ApiserverSource
}

func IsCriticalPodBasedOnPriority(priority int32) bool {
    return priority >= scheduling.SystemCriticalPriority
}

從代碼看，如果 Pod 是 Static、Mirror、Critical Pod 都不驅逐。其中 Static 和 Mirror 都是從 Pod 的 annotation 中判斷；而 Critical 則是通過 Pod 的 Priority 值判斷的，如果 Priority 為

system-cluster-critical

system-node-critical

都屬于 Critical Pod。

不過這裡值得注意的是，官方文檔裡提及 Critical Pod 是說，如果非 Static Pod 被标記為 Critical，并不完全保證不會被驅逐：

https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods

。是以，很有可能是社群并沒有想清楚這種情況是否要驅逐，并不排除後面會改變這段邏輯，不過也有可能是文檔沒有及時更新🌚。

總結

本文主要分析了 Kubelet 的 Eviction Manager，包括其對 Linux CGroup 事件的監聽、判斷 Pod 驅逐的優先級等。了解了這些之後，我們就可以根據自身應用的重要性來設定優先級，甚至設定成 Critical Pod。

淺析 Kubelet 驅逐機制

Kubelet 的配置

Eviction Manager 工作原理

notifier

synchronize

計算驅逐順序

驅逐 Pod

總結

繼續閱讀

Apache (You don't have permission to access / on this server.）

debian9更新4.9.0核心到4.19.2核心過程

centOS7 配置 vsftpd 虛拟使用者及權限Vsftpd配置虛拟使用者及權限

linux-svn解除安裝與安裝

vsftp虛拟多使用者多權限一鍵部署腳本

Ubuntu14.04 LTS下安裝mongodb

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

配置網頁内容通路

手動安裝Intel network I217-LM網卡的Linux驅動

禁止ubuntu系統彈出報錯界面

Ubuntu Linux下Apache的配置檔案

samba伺服器的功能

【Linux】UDP廣播封包接收速率問題

Linux裝置模型（中）之上層容器

PowerPC平台 Linux移植三