深入分析Kubernetes Critical Pod（三）

本文介紹了Kubelet在Predicate Admit準入檢查時對CriticalPod的資源搶占的原理，以及Priority Admission Controller對CriticalPod的PriorityClassName特殊處理。

深入分析Kubernetes Critical Pod系列：

深入分析Kubernetes Critical Pod（一）深入分析Kubernetes Critical Pod（二）深入分析Kubernetes Critical Pod（三）深入分析Kubernetes Critical Pod（四）

Kubelet Predicate Admit時對Critical的資源搶占處理

kubelet 在Predicate Admit流程中，會對Pods進行各種Predicate準入檢查，包括GeneralPredicates檢查本節點是否有足夠的cpu,mem,gpu資源。如果GeneralPredicates準入檢測失敗，對于nonCriticalPod則直接Admit失敗，但如果是CriticalPod則會觸發kubelet preemption進行資源搶占，按照一定規則殺死一些Pods釋放資源，搶占成功，則Admit成功。

流程的源頭應該從kubelet初始化的流程開始。

pkg/kubelet/kubelet.go:315

// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.
// No initialization of Kubelet and its modules should happen here.
func NewMainKubelet(...) (*Kubelet, error) {
    ...
   criticalPodAdmissionHandler := preemption.NewCriticalPodAdmissionHandler(klet.GetActivePods, killPodNow(klet.podWorkers, kubeDeps.Recorder), kubeDeps.Recorder)
    klet.admitHandlers.AddPodAdmitHandler(lifecycle.NewPredicateAdmitHandler(klet.getNodeAnyWay, criticalPodAdmissionHandler, klet.containerManager.UpdatePluginResources))
    // apply functional Option's
    for _, opt := range kubeDeps.Options {
        opt(klet)
    }

    ...
    return klet, nil
}

在NewMainKubelet對kubelet進行初始化時，通過AddPodAdmitHandler注冊了criticalPodAdmissionHandler，CriticalPod的Admit的特殊之處就展現在criticalPodAdmissionHandler。

然後，我們進入kubelet的predicateAdmitHandler流程中，看看GeneralPredicates失敗後的處理邏輯。

pkg/kubelet/lifecycle/predicate.go:58

func (w *predicateAdmitHandler) Admit(attrs *PodAdmitAttributes) PodAdmitResult {
    ...

    fit, reasons, err := predicates.GeneralPredicates(podWithoutMissingExtendedResources, nil, nodeInfo)
    if err != nil {
        message := fmt.Sprintf("GeneralPredicates failed due to %v, which is unexpected.", err)
        glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message)
        return PodAdmitResult{
            Admit:   fit,
            Reason:  "UnexpectedAdmissionError",
            Message: message,
        }
    }
    if !fit {
        fit, reasons, err = w.admissionFailureHandler.HandleAdmissionFailure(pod, reasons)
        if err != nil {
            message := fmt.Sprintf("Unexpected error while attempting to recover from admission failure: %v", err)
            glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message)
            return PodAdmitResult{
                Admit:   fit,
                Reason:  "UnexpectedAdmissionError",
                Message: message,
            }
        }
    }
    ...
    return PodAdmitResult{
        Admit: true,
    }
}

在kubelet predicateAdmitHandler中對Pod進行GeneralPredicates檢查cpu,mem,gpu資源時，如果發現資源不足導緻Admit失敗，則接着調用HandleAdmissionFailure進行額外處理。前提提到，kubelet初始化時注冊了criticalPodAdmissionHandler為HandleAdmissionFailure。

CriticalPodAdmissionHandler struct定義如下：

pkg/kubelet/preemption/preemption.go:41

type CriticalPodAdmissionHandler struct {
    getPodsFunc eviction.ActivePodsFunc
    killPodFunc eviction.KillPodFunc
    recorder    record.EventRecorder
}

CriticalPodAdmissionHandler的HandleAdmissionFailure方法就是處理CriticalPod特殊的邏輯所在。

pkg/kubelet/preemption/preemption.go:66

// HandleAdmissionFailure gracefully handles admission rejection, and, in some cases,
// to allow admission of the pod despite its previous failure.
func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(pod *v1.Pod, failureReasons []algorithm.PredicateFailureReason) (bool, []algorithm.PredicateFailureReason, error) {
    if !kubetypes.IsCriticalPod(pod) || !utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) {
        return false, failureReasons, nil
    }
    // InsufficientResourceError is not a reason to reject a critical pod.
    // Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist.
    nonResourceReasons := []algorithm.PredicateFailureReason{}
    resourceReasons := []*admissionRequirement{}
    for _, reason := range failureReasons {
        if r, ok := reason.(*predicates.InsufficientResourceError); ok {
            resourceReasons = append(resourceReasons, &admissionRequirement{
                resourceName: r.ResourceName,
                quantity:     r.GetInsufficientAmount(),
            })
        } else {
            nonResourceReasons = append(nonResourceReasons, reason)
        }
    }
    if len(nonResourceReasons) > 0 {
        // Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons.
        return false, nonResourceReasons, nil
    }
    err := c.evictPodsToFreeRequests(admissionRequirementList(resourceReasons))
    // if no error is returned, preemption succeeded and the pod is safe to admit.
    return err == nil, nil, err
}

如果Pod不是CriticalPod，或者ExperimentalCriticalPodAnnotation Feature Gate是關閉的，則直接傳回false，表示Admit失敗。
判斷Admit的failureReasons是否包含 predicate.InsufficientResourceError ，如果包含，則調用evictPodsToFreeRequests觸發kubelet preemption。注意這裡的搶占不同于scheduler preemtion，不要混淆了。

evictPodsToFreeRequests就是kubelet preemption進行資源搶占的邏輯實作，其核心就是調用getPodsToPreempt挑選合适的待殺死的Pods(podsToPreempt)。

pkg/kubelet/preemption/preemption.go:121

// getPodsToPreempt returns a list of pods that could be preempted to free requests >= requirements
func getPodsToPreempt(pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
    bestEffortPods, burstablePods, guaranteedPods := sortPodsByQOS(pods)

    // make sure that pods exist to reclaim the requirements
    unableToMeetRequirements := requirements.subtract(append(append(bestEffortPods, burstablePods...), guaranteedPods...)...)
    if len(unableToMeetRequirements) > 0 {
        return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", unableToMeetRequirements.toString())
    }
    // find the guaranteed pods we would need to evict if we already evicted ALL burstable and besteffort pods.
    guarateedToEvict, err := getPodsToPreemptByDistance(guaranteedPods, requirements.subtract(append(bestEffortPods, burstablePods...)...))
    if err != nil {
        return nil, err
    }
    // Find the burstable pods we would need to evict if we already evicted ALL besteffort pods, and the required guaranteed pods.
    burstableToEvict, err := getPodsToPreemptByDistance(burstablePods, requirements.subtract(append(bestEffortPods, guarateedToEvict...)...))
    if err != nil {
        return nil, err
    }
    // Find the besteffort pods we would need to evict if we already evicted the required guaranteed and burstable pods.
    bestEffortToEvict, err := getPodsToPreemptByDistance(bestEffortPods, requirements.subtract(append(burstableToEvict, guarateedToEvict...)...))
    if err != nil {
        return nil, err
    }
    return append(append(bestEffortToEvict, burstableToEvict...), guarateedToEvict...), nil
}

kubelet preemtion時候挑選待殺死Pods的邏輯如下：

如果該Pod的某個Resource request quantity超過了現在的所有的bestEffortPods, burstablePods, guaranteedPods的該Resource request quantity，則podsToPreempt為nil，意味着無合适Pods以釋放。
如果釋放所有bestEffortPods, burstablePods的資源都不足夠，則再挑選guaranteedPods（guarateedToEvict）。挑選的規則是：
- 規則一：越少的Pods被釋放越好；
- 規則二：釋放的資源越少越好；
- 規則一的優先級比規則二高；
如果釋放所有bestEffortPods及guarateedToEvict的資源都不足夠，則再挑選burstablePods(burstableToEvict)。挑選的規則同上。
如果釋放所有burstableToEvict及guarateedToEvict的資源都不足夠，則再挑選bestEffortPods(bestEffortToEvict)。挑選的規則同上。

也就是說：Pod Resource QoS優先級越低的越先被搶占，同一個QoS Level内挑選Pods按照如下規則：

Priority Admission Controller對CriticalPod的特殊處理

我們先看看幾類特殊的、系統預留的CriticalPod：

ClusterCriticalPod: PriorityClass Name是 system-cluster-critical 的Pod。
NodeCriticalPod:PriorityClass Name是 system-node-critical

如果AdmissionController中啟動了Priority Admission Controller，那麼在建立Pod時對Priority的檢查也存在CriticalPod的特殊處理。

Priority Admission Controller主要作用是根據Pod中指定的PriorityClassName替換成對應的Spec.Pritory數值。

plugin/pkg/admission/priority/admission.go:138

// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName.
func (p *priorityPlugin) admitPod(a admission.Attributes) error {
    operation := a.GetOperation()
    pod, ok := a.GetObject().(*api.Pod)
    if !ok {
        return errors.NewBadRequest("resource was marked with kind Pod but was unable to be converted")
    }

    // Make sure that the client has not set `priority` at the time of pod creation.
    if operation == admission.Create && pod.Spec.Priority != nil {
        return admission.NewForbidden(a, fmt.Errorf("the integer value of priority must not be provided in pod spec. Priority admission controller populates the value from the given PriorityClass name"))
    }
    if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) {
        var priority int32
        // TODO: @ravig - This is for backwards compatibility to ensure that critical pods with annotations just work fine.
        // Remove when no longer needed.
        if len(pod.Spec.PriorityClassName) == 0 &&
            utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
            kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) {
            pod.Spec.PriorityClassName = scheduling.SystemClusterCritical
        }
        if len(pod.Spec.PriorityClassName) == 0 {
            var err error
            priority, err = p.getDefaultPriority()
            if err != nil {
                return fmt.Errorf("failed to get default priority class: %v", err)
            }
        } else {
            // Try resolving the priority class name.
            pc, err := p.lister.Get(pod.Spec.PriorityClassName)
            if err != nil {
                if errors.IsNotFound(err) {
                    return admission.NewForbidden(a, fmt.Errorf("no PriorityClass with name %v was found", pod.Spec.PriorityClassName))
                }

                return fmt.Errorf("failed to get PriorityClass with name %s: %v", pod.Spec.PriorityClassName, err)
            }

            priority = pc.Value
        }
        pod.Spec.Priority = &priority
    }
    return nil
}

同時滿足以下所有條件時，給Pod的

Spec.PriorityClassName

指派為

system-cluster-critical

,即認為是ClusterCriticalPod。

如果Enable了ExperimentalCriticalPodAnnotation和PodPriority Feature Gate；
該Pod沒有指定PriorityClassName；
該Pod屬于kube-system namespace；
該Pod打了 scheduler.alpha.kubernetes.io/critical-pod="" Annotation；

總結

本文介紹了Kubelet在Predicate Admit準入檢查時對CriticalPod的資源搶占的原理，以及Priority Admission Controller對CriticalPod的PriorityClassName特殊處理。下一篇是最後一處關于Kubernetes對CriticalPod進行特殊待遇的地方——DaemonSet Controller。

深入分析Kubernetes Critical Pod（三）

Kubelet Predicate Admit時對Critical的資源搶占處理

Priority Admission Controller對CriticalPod的特殊處理

總結

繼續閱讀

WINDOWS下安裝MRTG全攻略

使用jvm監控工具(jconsole、jvisualvm)通過jmx遠端連接配接kubernetes上的java應用

configure/make/make install的作用

ubuntu下gvim配置檔案.vimrc

Docker - Docker Volume及Volume指令詳解

SPOJ QTREE4 Query on a tree IV

如何配置Eclipse進行Perl開發

npm install stylus --save失敗

Error: docker-ce conflicts with 2:docker-1.13.1-53.git774336d.el7.centos.x86_64

在Windows上編譯Wireshark源代碼 .

Learning Perl: 1.3. How Can I Get Perl?

golang建構Dockerfile，并打包成鏡像，運作在docker和k8s上

Docker-compose 進行Doris自動化編排部署

服裝資訊化數字化變革

使用kubeadm+calico部署kubernetes v1.25.3

Perl與網絡監控