天天看點

kube-scheduler源碼分析(3)-搶占排程分析

作者:可愛的雯雯子

kube-scheduler源碼分析(3)-搶占排程分析

kube-scheduler簡介

kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源對象的排程工作,具體來說,kube-scheduler元件負責根據排程算法(包括預選算法和優選算法)将未排程的pod排程到合适的最優的node節點上。

kube-scheduler架構圖

kube-scheduler的大緻組成和處理流程如下圖,kube-scheduler對pod、node等對象進行了list/watch,根據informer将未排程的pod放入待排程pod隊列,并根據informer建構排程器cache(用于快速擷取需要的node等對象),然後sched.scheduleOne方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod隊列中取出一個pod,經過預選與優選算法,最終選出一個最優node,上述步驟都成功則更新cache并異步執行bind操作,也就是更新pod的nodeName字段,失敗則進入搶占邏輯,至此一個pod的排程工作完成。

kube-scheduler源碼分析(3)-搶占排程分析

kube-scheduler搶占排程概述

優先級和搶占機制,解決的是 Pod 排程失敗時該怎麼辦的問題。

正常情況下,當一個 pod 排程失敗後,就會被暫時 “擱置” 處于 pending 狀态,直到 pod 被更新或者叢集狀态發生變化,排程器才會對這個 pod 進行重新排程。

但是有的時候,我們希望給pod分等級,即分優先級。當一個高優先級的 Pod 排程失敗後,該 Pod 并不會被“擱置”,而是會“擠走”某個 Node 上的一些低優先級的 Pod,這樣一來就可以保證高優先級 Pod 會優先排程成功。

關于pod優先級,具體請參考:https://kubernetes.io/zh/docs/concepts/scheduling-eviction/pod-priority-preemption/

搶占發生的原因,一定是一個高優先級的 pod 排程失敗,我們稱這個 pod 為“搶占者”,稱被搶占的 pod 為“犧牲者”(victims)。

PDB概述

PDB全稱PodDisruptionBudget,可以了解為是k8s中用來保證Deployment、StatefulSet等控制器在叢集中存在的最小副本數量的一個對象。

具體請參考:

https://kubernetes.io/zh/docs/concepts/workloads/pods/disruptions/

https://kubernetes.io/zh/docs/tasks/run-application/configure-pdb/

搶占排程功能開啟與關閉配置

kube-scheduler的搶占排程功能預設開啟。

在 Kubernetes 1.15+版本,如果 NonPreemptingPriority被啟用了(kube-scheduler元件啟動參數--feature-gates=NonPreemptingPriority=true) ,PriorityClass 可以設定 preemptionPolicy: Never,則該 PriorityClass 的所有 Pod在排程失敗後将不會執行搶占邏輯。

另外,在 Kubernetes 1.11+版本,kube-scheduler元件也可以配置檔案參數設定将搶占排程功能關閉(注意:不能通過元件啟動指令行參數設定)。

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
...
disablePreemption: true
           

配置檔案通過kube-scheduler啟動參數--config指定。

kube-scheduler啟動參數參考:https://kubernetes.io/zh/docs/reference/command-line-tools-reference/kube-scheduler/

kube-scheduler配置檔案參考:https://kubernetes.io/zh/docs/reference/scheduling/config/

kube-scheduler元件的分析将分為三大塊進行,分别是:

(1)kube-scheduler初始化與啟動分析;

(2)kube-scheduler核心處理邏輯分析;

(3)kube-scheduler搶占排程邏輯分析;

本篇進行搶占排程邏輯分析。

3.kube-scheduler搶占排程邏輯分析

基于tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

分析入口-scheduleOne

把scheduleOne方法作為kube-scheduler元件搶占排程的分析入口,這裡隻關注到scheduleOne方法中搶占排程相關的邏輯:

(1)調用sched.Algorithm.Schedule方法,排程pod;

(2)pod排程失敗後,調用sched.DisablePreemption判斷kube-scheduler元件是否關閉了搶占排程功能;

(3)如未關閉搶占排程功能,則調用sched.preempt進行搶占排程邏輯;

// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne(ctx context.Context) {
    ...
    // 排程pod
    scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, state, pod)
	if err != nil {
		...
		if fitError, ok := err.(*core.FitError); ok {
		    // 判斷是否關閉了搶占排程功能
			if sched.DisablePreemption {
				klog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
					" No preemption is performed.")
			} else {
			// 搶占排程邏輯
				preemptionStartTime := time.Now()
				sched.preempt(schedulingCycleCtx, state, fwk, pod, fitError)
				...
			}
			...
	}
	...
           

sched.preempt

sched.preempt為kube-scheduler搶占排程處理邏輯所在,主要邏輯:

(1)調用sched.Algorithm.Preempt,模拟pod搶占排程過程,傳回pod可以搶占的node節點、被搶占的pod清單、需要去除NominatedNodeName屬性的pod清單;

(2)調用sched.podPreemptor.setNominatedNodeName,請求apiserver,将可以搶占的node節點名稱設定到pod的NominatedNodeName屬性值中,然後該pod會重新進入待排程pod隊列,等待再一次排程;

(3)周遊被搶占的pod清單,請求apiserver,删除pod;

(4)周遊需要去除NominatedNodeName屬性的pod清單,請求apiserver,更新pod,去除pod的NominatedNodeName屬性值;

注意:搶占排程處理邏輯并馬上把排程失敗的pod再次搶占排程到node上,而是根據模拟搶占的結果,删除被搶占pod,空出相應的資源,最後把該排程失敗的pod交給下一個排程周期再處理。

// pkg/scheduler/scheduler.go
func (sched *Scheduler) preempt(ctx context.Context, state *framework.CycleState, fwk framework.Framework, preemptor *v1.Pod, scheduleErr error) (string, error) {
	...
    // (1)模拟pod搶占排程過程,傳回pod可以搶占的node節點、被搶占的pod清單、需要去除nominateName屬性的pod清單
	node, victims, nominatedPodsToClear, err := sched.Algorithm.Preempt(ctx, state, preemptor, scheduleErr)
	if err != nil {
		klog.Errorf("Error preempting victims to make room for %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
		return "", err
	}
	var nodeName = ""
	if node != nil {
		nodeName = node.Name
		
		sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

		// (2)請求apiserver,将可以搶占的node節點名稱設定到pod的nominatedNode屬性值中,然後該pod會重新進入待排程pod隊列,等待再一次排程
		err = sched.podPreemptor.setNominatedNodeName(preemptor, nodeName)
		if err != nil {
			klog.Errorf("Error in preemption process. Cannot set 'NominatedPod' on pod %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
			sched.SchedulingQueue.DeleteNominatedPodIfExists(preemptor)
			return "", err
		}
        
        // (3)周遊被搶占的pod清單,請求apiserver,删除pod
		for _, victim := range victims {
			if err := sched.podPreemptor.deletePod(victim); err != nil {
				klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
				return "", err
			}
			// If the victim is a WaitingPod, send a reject message to the PermitPlugin
			if waitingPod := fwk.GetWaitingPod(victim.UID); waitingPod != nil {
				waitingPod.Reject("preempted")
			}
			sched.Recorder.Eventf(victim, preemptor, v1.EventTypeNormal, "Preempted", "Preempting", "Preempted by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)

		}
		metrics.PreemptionVictims.Observe(float64(len(victims)))
	}
	// (4)周遊需要去除nominateName屬性的pod清單,請求apiserver,更新pod,去除pod的nominateName屬性值
	for _, p := range nominatedPodsToClear {
		rErr := sched.podPreemptor.removeNominatedNodeName(p)
		if rErr != nil {
			klog.Errorf("Cannot remove 'NominatedPod' field of pod: %v", rErr)
			// We do not return as this error is not critical.
		}
	}
	return nodeName, err
}
           

sched.Algorithm.Preempt

sched.Algorithm.Preempt方法模拟pod搶占排程過程,傳回pod可以搶占的node節點、被搶占的pod清單、需要去除NominatedNodeName屬性的pod清單,主要邏輯為:

(1)調用nodesWherePreemptionMightHelp,擷取預選失敗且移除部分pod之後可能可以滿足排程條件的節點;

(2)擷取PodDisruptionBudget對象,用于後續篩選可以被搶占的node節點清單(關于PodDisruptionBudget的用法,可自行搜尋資料檢視);

(3)調用g.selectNodesForPreemption,篩選可以被搶占的node節點清單,并傳回node節點上被搶占的pod的最小集合;

(4)周遊scheduler-extender(kube-scheduler的一種webhook擴充機制),執行extender的搶占處理邏輯,根據處理邏輯過濾可以被搶占的node節點清單;

(5)調用pickOneNodeForPreemption,從可被搶占的node節點清單中挑選出一個node節點;

(6)調用g.getLowerPriorityNominatedPods,擷取被搶占node節點上NominatedNodeName屬性不為空且優先級比搶占pod低的pod清單;

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) Preempt(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
	// Scheduler may return various types of errors. Consider preemption only if
	// the error is of type FitError.
	fitError, ok := scheduleErr.(*FitError)
	if !ok || fitError == nil {
		return nil, nil, nil, nil
	}
	if !podEligibleToPreemptOthers(pod, g.nodeInfoSnapshot.NodeInfoMap, g.enableNonPreempting) {
		klog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
		return nil, nil, nil, nil
	}
	if len(g.nodeInfoSnapshot.NodeInfoMap) == 0 {
		return nil, nil, nil, ErrNoNodesAvailable
	}
	// (1)擷取預選失敗且移除部分pod之後可能可以滿足排程條件的節點;
	potentialNodes := nodesWherePreemptionMightHelp(g.nodeInfoSnapshot.NodeInfoMap, fitError)
	if len(potentialNodes) == 0 {
		klog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
		// In this case, we should clean-up any existing nominated node name of the pod.
		return nil, nil, []*v1.Pod{pod}, nil
	}
	var (
		pdbs []*policy.PodDisruptionBudget
		err  error
	)
	// (2)擷取PodDisruptionBudget對象,用于後續篩選可以被搶占的node節點清單(關于PodDisruptionBudget的用法,可自行搜尋資料檢視);
	if g.pdbLister != nil {
		pdbs, err = g.pdbLister.List(labels.Everything())
		if err != nil {
			return nil, nil, nil, err
		}
	}
	// (3)擷取可以被搶占的node節點清單;  
	nodeToVictims, err := g.selectNodesForPreemption(ctx, state, pod, potentialNodes, pdbs)
	if err != nil {
		return nil, nil, nil, err
	}
	
    // (4)周遊scheduler-extender(kube-scheduler的一種webhook擴充機制),執行extender的搶占處理邏輯,根據處理邏輯過濾可以被搶占的node節點清單; 
	// We will only check nodeToVictims with extenders that support preemption.
	// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
	// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
	nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
	if err != nil {
		return nil, nil, nil, err
	}
    
    // (5)從可被搶占的node節點清單中挑選出一個node節點;  
	candidateNode := pickOneNodeForPreemption(nodeToVictims)
	if candidateNode == nil {
		return nil, nil, nil, nil
	}
    
    // (6)擷取被搶占node節點上nominateName屬性不為空且優先級比搶占pod低的pod清單;  
	// Lower priority pods nominated to run on this node, may no longer fit on
	// this node. So, we should remove their nomination. Removing their
	// nomination updates these pods and moves them to the active queue. It
	// lets scheduler find another place for them.
	nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
	if nodeInfo, ok := g.nodeInfoSnapshot.NodeInfoMap[candidateNode.Name]; ok {
		return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, nil
	}

	return nil, nil, nil, fmt.Errorf(
		"preemption failed: the target node %s has been deleted from scheduler cache",
		candidateNode.Name)
}
           

3.1 nodesWherePreemptionMightHelp

nodesWherePreemptionMightHelp函數主要是傳回預選失敗且移除部分pod之後可能可以滿足排程條件的節點。

怎麼判斷某個預選失敗的node節點移除部分pod之後可能可以滿足排程條件呢?主要邏輯看到predicates.UnresolvablePredicateExists方法。

// pkg/scheduler/core/generic_scheduler.go
func nodesWherePreemptionMightHelp(nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, fitErr *FitError) []*v1.Node {
	potentialNodes := []*v1.Node{}
	for name, node := range nodeNameToInfo {
		if fitErr.FilteredNodesStatuses[name].Code() == framework.UnschedulableAndUnresolvable {
			continue
		}
		failedPredicates := fitErr.FailedPredicates[name]

		// If we assume that scheduler looks at all nodes and populates the failedPredicateMap
		// (which is the case today), the !found case should never happen, but we'd prefer
		// to rely less on such assumptions in the code when checking does not impose
		// significant overhead.
		// Also, we currently assume all failures returned by extender as resolvable.
		if predicates.UnresolvablePredicateExists(failedPredicates) == nil {
			klog.V(3).Infof("Node %v is a potential node for preemption.", name)
			potentialNodes = append(potentialNodes, node.Node())
		}
	}
	return potentialNodes
}
           

3.1.1 predicates.UnresolvablePredicateExists

隻要預選算法執行失敗的node節點,其失敗的原因不屬于unresolvablePredicateFailureErrors中任何一個原因時,則該預選失敗的node節點移除部分pod之後可能可以滿足排程條件。

unresolvablePredicateFailureErrors包括節點NodeSelector不比對、pod反親和規則不符合、污點不容忍、節點屬于NotReady狀态、節點記憶體不足等等。

// pkg/scheduler/algorithm/predicates/error.go
var unresolvablePredicateFailureErrors = map[PredicateFailureReason]struct{}{
	ErrNodeSelectorNotMatch:      {},
	ErrPodAffinityRulesNotMatch:  {},
	ErrPodNotMatchHostName:       {},
	ErrTaintsTolerationsNotMatch: {},
	ErrNodeLabelPresenceViolated: {},
	// Node conditions won't change when scheduler simulates removal of preemption victims.
	// So, it is pointless to try nodes that have not been able to host the pod due to node
	// conditions. These include ErrNodeNotReady, ErrNodeUnderPIDPressure, ErrNodeUnderMemoryPressure, ....
	ErrNodeNotReady:            {},
	ErrNodeNetworkUnavailable:  {},
	ErrNodeUnderDiskPressure:   {},
	ErrNodeUnderPIDPressure:    {},
	ErrNodeUnderMemoryPressure: {},
	ErrNodeUnschedulable:       {},
	ErrNodeUnknownCondition:    {},
	ErrVolumeZoneConflict:      {},
	ErrVolumeNodeConflict:      {},
	ErrVolumeBindConflict:      {},
}

// UnresolvablePredicateExists checks if there is at least one unresolvable predicate failure reason, if true
// returns the first one in the list.
func UnresolvablePredicateExists(reasons []PredicateFailureReason) PredicateFailureReason {
	for _, r := range reasons {
		if _, ok := unresolvablePredicateFailureErrors[r]; ok {
			return r
		}
	}
	return nil
}
           

3.2 g.selectNodesForPreemption

g.selectNodesForPreemption方法,用于擷取可以被搶占的node節點清單,并傳回node節點上被搶占的pod的最小集合,主要邏輯如下:

(1)定義checkNode函數,主要是調用g.selectVictimsOnNode方法,方法傳回某node是否适合被搶占,并傳回該node節點上被搶占的pod的最小集合、被搶占pod中定義了PDB的pod數量;

(2)拉起16個goroutine,并發調用checkNode函數,對預選失敗的node節點清單進行是否适合被搶占的檢查;

// pkg/scheduler/core/generic_scheduler.go
// selectNodesForPreemption finds all the nodes with possible victims for
// preemption in parallel.
func (g *genericScheduler) selectNodesForPreemption(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	potentialNodes []*v1.Node,
	pdbs []*policy.PodDisruptionBudget,
) (map[*v1.Node]*extenderv1.Victims, error) {
	nodeToVictims := map[*v1.Node]*extenderv1.Victims{}
	var resultLock sync.Mutex
    
    // (1)定義checkNode函數
	// We can use the same metadata producer for all nodes.
	meta := g.predicateMetaProducer(pod, g.nodeInfoSnapshot)
	checkNode := func(i int) {
		nodeName := potentialNodes[i].Name
		if g.nodeInfoSnapshot.NodeInfoMap[nodeName] == nil {
			return
		}
		nodeInfoCopy := g.nodeInfoSnapshot.NodeInfoMap[nodeName].Clone()
		var metaCopy predicates.Metadata
		if meta != nil {
			metaCopy = meta.ShallowCopy()
		}
		stateCopy := state.Clone()
		stateCopy.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: metaCopy})
		// 調用g.selectVictimsOnNode方法,方法傳回某node是否适合被搶占,并傳回該node節點上被搶占的pod的最小集合、與PDB沖突的pod數量; 
		pods, numPDBViolations, fits := g.selectVictimsOnNode(ctx, stateCopy, pod, metaCopy, nodeInfoCopy, pdbs)
		if fits {
			resultLock.Lock()
			victims := extenderv1.Victims{
				Pods:             pods,
				NumPDBViolations: int64(numPDBViolations),
			}
			nodeToVictims[potentialNodes[i]] = &victims
			resultLock.Unlock()
		}
	}
	// (2)拉起16個goroutine,并發調用checkNode函數,對預選失敗的node節點清單進行是否适合被搶占的檢查;
	workqueue.ParallelizeUntil(context.TODO(), 16, len(potentialNodes), checkNode)
	return nodeToVictims, nil
}
           

3.2.1 g.selectVictimsOnNode

g.selectVictimsOnNode方法用于判斷某node是否适合被搶占,并傳回該node節點上被搶占的pod的最小集合、被搶占pod中定義了PDB的pod數量。

主要邏輯:

(1)首先,假設把該node節點上比搶占pod優先級低的所有pod都删除掉,然後調用預選算法,看pod在該node上是否滿足排程條件,假如還是不符合排程條件,則該node節點不适合被搶占,直接return;

(2)将所有比搶占pod優先級低的pod按優先級高低進行排序,優先級最低的排在最前面;

(3)将排好序的pod清單按是否定義了PDB分成兩個pod清單;

(4)先周遊定義了PDB的pod清單,逐一删除pod(被删除的pod稱為被搶占pod),每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;

(5)假如周遊完定義了PDB的pod清單後,搶占pod在該node上任然不滿足排程條件,則繼續周遊沒有定義PDB的pod清單,逐一删除pod,每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;

(6)如果上述兩個pod清單裡的pod都被删除後,搶占pod在該node上任然不滿足排程條件,則該node不适合被搶占,return。

注意:以上說的删除pod并不是真正的删除,而是模拟删除後,搶占pod是否滿足排程條件而已。真正的删除被搶占pod的操作在後續确定了要搶占的node節點後,再删除該node節點上被搶占的pod。

// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) selectVictimsOnNode(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	meta predicates.Metadata,
	nodeInfo *schedulernodeinfo.NodeInfo,
	pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, bool) {
	var potentialVictims []*v1.Pod

	removePod := func(rp *v1.Pod) error {
		if err := nodeInfo.RemovePod(rp); err != nil {
			return err
		}
		if meta != nil {
			if err := meta.RemovePod(rp, nodeInfo.Node()); err != nil {
				return err
			}
		}
		status := g.framework.RunPreFilterExtensionRemovePod(ctx, state, pod, rp, nodeInfo)
		if !status.IsSuccess() {
			return status.AsError()
		}
		return nil
	}
	addPod := func(ap *v1.Pod) error {
		nodeInfo.AddPod(ap)
		if meta != nil {
			if err := meta.AddPod(ap, nodeInfo.Node()); err != nil {
				return err
			}
		}
		status := g.framework.RunPreFilterExtensionAddPod(ctx, state, pod, ap, nodeInfo)
		if !status.IsSuccess() {
			return status.AsError()
		}
		return nil
	}
	// (1)首先,假設把該node節點上比搶占pod優先級低的所有pod都删除掉,然後調用預選算法,看pod在該node上是否滿足排程條件,假如還是不符合排程條件,則該node節點不适合被搶占,直接return
	// As the first step, remove all the lower priority pods from the node and
	// check if the given pod can be scheduled.
	podPriority := podutil.GetPodPriority(pod)
	for _, p := range nodeInfo.Pods() {
		if podutil.GetPodPriority(p) < podPriority {
			potentialVictims = append(potentialVictims, p)
			if err := removePod(p); err != nil {
				return nil, 0, false
			}
		}
	}
	// If the new pod does not fit after removing all the lower priority pods,
	// we are almost done and this node is not suitable for preemption. The only
	// condition that we could check is if the "pod" is failing to schedule due to
	// inter-pod affinity to one or more victims, but we have decided not to
	// support this case for performance reasons. Having affinity to lower
	// priority pods is not a recommended configuration anyway.
	if fits, _, _, err := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false); !fits {
		if err != nil {
			klog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err)
		}

		return nil, 0, false
	}
	var victims []*v1.Pod
	numViolatingVictim := 0
	// (2)将所有比搶占pod優先級低的pod按優先級高低進行排序,優先級最低的排在最前面;  
	sort.Slice(potentialVictims, func(i, j int) bool { return util.MoreImportantPod(potentialVictims[i], potentialVictims[j]) })
	// Try to reprieve as many pods as possible. We first try to reprieve the PDB
	// violating victims and then other non-violating ones. In both cases, we start
	// from the highest priority victims.
	// (3)将排好序的pod清單按是否定義了PDB分成兩個pod清單; 
	violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
	reprievePod := func(p *v1.Pod) (bool, error) {
		if err := addPod(p); err != nil {
			return false, err
		}
		fits, _, _, _ := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false)
		if !fits {
			if err := removePod(p); err != nil {
				return false, err
			}
			victims = append(victims, p)
			klog.V(5).Infof("Pod %v/%v is a potential preemption victim on node %v.", p.Namespace, p.Name, nodeInfo.Node().Name)
		}
		return fits, nil
	}
	// (4)先周遊定義了PDB的pod清單,逐一删除pod(被删除的pod稱為被搶占pod),每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;  
	for _, p := range violatingVictims {
		if fits, err := reprievePod(p); err != nil {
			klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
			return nil, 0, false
		} else if !fits {
			numViolatingVictim++
		}
	}
	// (5)假如周遊完定義了PDB的pod清單後,搶占pod在該node上任然不滿足排程條件,則繼續周遊沒有定義PDB的pod清單,逐一删除pod,每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;  
	// Now we try to reprieve non-violating victims.
	for _, p := range nonViolatingVictims {
		if _, err := reprievePod(p); err != nil {
			klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
			return nil, 0, false
		}
	}
	// (6)如果上述兩個pod清單裡的pod都被删除後,搶占pod在該node上任然不滿足排程條件,則該node不适合被搶占,return。 
	return victims, numViolatingVictim, true
}
           

3.3 pickOneNodeForPreemption

pickOneNodeForPreemption函數,從可被搶占的node節點清單中挑選出一個node節點,該函數将按順序參照下列規則來挑選最優的被搶占node,直到某個條件能夠選出唯一的一個node節點:

(1)node節點沒有被搶占pod的,優先選擇;

(2)被搶占pod中定義了PDB的pod數量最少的節點;

(3)高優先級pod數量最少的節點;

(4)對node節點上所有被搶占pod的優先級進行相加,選取其值最小的節點;

(5)選擇被搶占pod數量最少的node節點;

(6)選擇被搶占pod中運作時間最短的pod所在node節點;

(7)傳回符合上述條件的最後一個node節點;

// pkg/scheduler/core/generic_scheduler.go
func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*extenderv1.Victims) *v1.Node {
	if len(nodesToVictims) == 0 {
		return nil
	}
	minNumPDBViolatingPods := int64(math.MaxInt32)
	var minNodes1 []*v1.Node
	lenNodes1 := 0
	for node, victims := range nodesToVictims {
	    // (1)node節點沒有被搶占pod的,優先選擇
		if len(victims.Pods) == 0 {
			// We found a node that doesn't need any preemption. Return it!
			// This should happen rarely when one or more pods are terminated between
			// the time that scheduler tries to schedule the pod and the time that
			// preemption logic tries to find nodes for preemption.
			return node
		}
		//(2)與PDB沖突的pod數量最少的節點
		numPDBViolatingPods := victims.NumPDBViolations
		if numPDBViolatingPods < minNumPDBViolatingPods {
			minNumPDBViolatingPods = numPDBViolatingPods
			minNodes1 = nil
			lenNodes1 = 0
		}
		if numPDBViolatingPods == minNumPDBViolatingPods {
			minNodes1 = append(minNodes1, node)
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}
    
    // (3)高優先級pod數量最少的節點
	// There are more than one node with minimum number PDB violating pods. Find
	// the one with minimum highest priority victim.
	minHighestPriority := int32(math.MaxInt32)
	var minNodes2 = make([]*v1.Node, lenNodes1)
	lenNodes2 := 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		victims := nodesToVictims[node]
		// highestPodPriority is the highest priority among the victims on this node.
		highestPodPriority := podutil.GetPodPriority(victims.Pods[0])
		if highestPodPriority < minHighestPriority {
			minHighestPriority = highestPodPriority
			lenNodes2 = 0
		}
		if highestPodPriority == minHighestPriority {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	if lenNodes2 == 1 {
		return minNodes2[0]
	}
    
    // (4)對node節點上所有被搶占pod的優先級進行相加,選取其值最小的節點
	// There are a few nodes with minimum highest priority victim. Find the
	// smallest sum of priorities.
	minSumPriorities := int64(math.MaxInt64)
	lenNodes1 = 0
	for i := 0; i < lenNodes2; i++ {
		var sumPriorities int64
		node := minNodes2[i]
		for _, pod := range nodesToVictims[node].Pods {
			// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
			// needed so that a node with a few pods with negative priority is not
			// picked over a node with a smaller number of pods with the same negative
			// priority (and similar scenarios).
			sumPriorities += int64(podutil.GetPodPriority(pod)) + int64(math.MaxInt32+1)
		}
		if sumPriorities < minSumPriorities {
			minSumPriorities = sumPriorities
			lenNodes1 = 0
		}
		if sumPriorities == minSumPriorities {
			minNodes1[lenNodes1] = node
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}
    
    // (5)選擇被搶占pod數量最少的node節點; 
	// There are a few nodes with minimum highest priority victim and sum of priorities.
	// Find one with the minimum number of pods.
	minNumPods := math.MaxInt32
	lenNodes2 = 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		numPods := len(nodesToVictims[node].Pods)
		if numPods < minNumPods {
			minNumPods = numPods
			lenNodes2 = 0
		}
		if numPods == minNumPods {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	if lenNodes2 == 1 {
		return minNodes2[0]
	}
    
    // (6)選擇被搶占pod中運作時間最短的pod所在node節點; 
	// There are a few nodes with same number of pods.
	// Find the node that satisfies latest(earliestStartTime(all highest-priority pods on node))
	latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
	if latestStartTime == nil {
		// If the earliest start time of all pods on the 1st node is nil, just return it,
		// which is not expected to happen.
		klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", minNodes2[0])
		return minNodes2[0]
	}
	nodeToReturn := minNodes2[0]
	for i := 1; i < lenNodes2; i++ {
		node := minNodes2[i]
		// Get earliest start time of all pods on the current node.
		earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node])
		if earliestStartTimeOnNode == nil {
			klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", node)
			continue
		}
		if earliestStartTimeOnNode.After(latestStartTime.Time) {
			latestStartTime = earliestStartTimeOnNode
			nodeToReturn = node
		}
	}
    
    // (7)傳回符合上述條件的最後一個node節點
	return nodeToReturn
}
           

總結

kube-scheduler源碼分析(3)-搶占排程分析

kube-scheduler搶占邏輯流程圖

下方處理流程圖展示了kube-scheduler搶占邏輯的核心處理步驟,在開始搶占邏輯處理之前,會先進行搶占排程功能是否開啟的判斷。

kube-scheduler源碼分析(3)-搶占排程分析

原文位址:https://www.cnblogs.com/lianngkyle/p/16000742.html

繼續閱讀