kube-scheduler源碼分析(3)-搶占排程分析
kube-scheduler簡介
kube-scheduler元件是kubernetes中的核心元件之一,主要負責pod資源對象的排程工作,具體來說,kube-scheduler元件負責根據排程算法(包括預選算法和優選算法)将未排程的pod排程到合适的最優的node節點上。
kube-scheduler架構圖
kube-scheduler的大緻組成和處理流程如下圖,kube-scheduler對pod、node等對象進行了list/watch,根據informer将未排程的pod放入待排程pod隊列,并根據informer建構排程器cache(用于快速擷取需要的node等對象),然後sched.scheduleOne方法為kube-scheduler元件排程pod的核心處理邏輯所在,從未排程pod隊列中取出一個pod,經過預選與優選算法,最終選出一個最優node,上述步驟都成功則更新cache并異步執行bind操作,也就是更新pod的nodeName字段,失敗則進入搶占邏輯,至此一個pod的排程工作完成。
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLiMGc902byZ2P4YTOkFzNjVTZ1QmZ5kzN4MDM0QzNmFjZ2ETO2IjY0QzLcBza5QTcsJja2FXLp1ibj1ycvR3Lc5Wanlmcv9CXt92YucWbp9WYpRXdvRnL5A3Lc9CX6MHc0RHaiojIsJye.jpg)
kube-scheduler搶占排程概述
優先級和搶占機制,解決的是 Pod 排程失敗時該怎麼辦的問題。
正常情況下,當一個 pod 排程失敗後,就會被暫時 “擱置” 處于 pending 狀态,直到 pod 被更新或者叢集狀态發生變化,排程器才會對這個 pod 進行重新排程。
但是有的時候,我們希望給pod分等級,即分優先級。當一個高優先級的 Pod 排程失敗後,該 Pod 并不會被“擱置”,而是會“擠走”某個 Node 上的一些低優先級的 Pod,這樣一來就可以保證高優先級 Pod 會優先排程成功。
關于pod優先級,具體請參考:https://kubernetes.io/zh/docs/concepts/scheduling-eviction/pod-priority-preemption/
搶占發生的原因,一定是一個高優先級的 pod 排程失敗,我們稱這個 pod 為“搶占者”,稱被搶占的 pod 為“犧牲者”(victims)。
PDB概述
PDB全稱PodDisruptionBudget,可以了解為是k8s中用來保證Deployment、StatefulSet等控制器在叢集中存在的最小副本數量的一個對象。
具體請參考:
https://kubernetes.io/zh/docs/concepts/workloads/pods/disruptions/
https://kubernetes.io/zh/docs/tasks/run-application/configure-pdb/
搶占排程功能開啟與關閉配置
kube-scheduler的搶占排程功能預設開啟。
在 Kubernetes 1.15+版本,如果 NonPreemptingPriority被啟用了(kube-scheduler元件啟動參數--feature-gates=NonPreemptingPriority=true) ,PriorityClass 可以設定 preemptionPolicy: Never,則該 PriorityClass 的所有 Pod在排程失敗後将不會執行搶占邏輯。
另外,在 Kubernetes 1.11+版本,kube-scheduler元件也可以配置檔案參數設定将搶占排程功能關閉(注意:不能通過元件啟動指令行參數設定)。
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
...
disablePreemption: true
配置檔案通過kube-scheduler啟動參數--config指定。
kube-scheduler啟動參數參考:https://kubernetes.io/zh/docs/reference/command-line-tools-reference/kube-scheduler/
kube-scheduler配置檔案參考:https://kubernetes.io/zh/docs/reference/scheduling/config/
kube-scheduler元件的分析将分為三大塊進行,分别是:
(1)kube-scheduler初始化與啟動分析;
(2)kube-scheduler核心處理邏輯分析;
(3)kube-scheduler搶占排程邏輯分析;
本篇進行搶占排程邏輯分析。
3.kube-scheduler搶占排程邏輯分析
基于tag v1.17.4
https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4
分析入口-scheduleOne
把scheduleOne方法作為kube-scheduler元件搶占排程的分析入口,這裡隻關注到scheduleOne方法中搶占排程相關的邏輯:
(1)調用sched.Algorithm.Schedule方法,排程pod;
(2)pod排程失敗後,調用sched.DisablePreemption判斷kube-scheduler元件是否關閉了搶占排程功能;
(3)如未關閉搶占排程功能,則調用sched.preempt進行搶占排程邏輯;
// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne(ctx context.Context) {
...
// 排程pod
scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, state, pod)
if err != nil {
...
if fitError, ok := err.(*core.FitError); ok {
// 判斷是否關閉了搶占排程功能
if sched.DisablePreemption {
klog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
" No preemption is performed.")
} else {
// 搶占排程邏輯
preemptionStartTime := time.Now()
sched.preempt(schedulingCycleCtx, state, fwk, pod, fitError)
...
}
...
}
...
sched.preempt
sched.preempt為kube-scheduler搶占排程處理邏輯所在,主要邏輯:
(1)調用sched.Algorithm.Preempt,模拟pod搶占排程過程,傳回pod可以搶占的node節點、被搶占的pod清單、需要去除NominatedNodeName屬性的pod清單;
(2)調用sched.podPreemptor.setNominatedNodeName,請求apiserver,将可以搶占的node節點名稱設定到pod的NominatedNodeName屬性值中,然後該pod會重新進入待排程pod隊列,等待再一次排程;
(3)周遊被搶占的pod清單,請求apiserver,删除pod;
(4)周遊需要去除NominatedNodeName屬性的pod清單,請求apiserver,更新pod,去除pod的NominatedNodeName屬性值;
注意:搶占排程處理邏輯并馬上把排程失敗的pod再次搶占排程到node上,而是根據模拟搶占的結果,删除被搶占pod,空出相應的資源,最後把該排程失敗的pod交給下一個排程周期再處理。
// pkg/scheduler/scheduler.go
func (sched *Scheduler) preempt(ctx context.Context, state *framework.CycleState, fwk framework.Framework, preemptor *v1.Pod, scheduleErr error) (string, error) {
...
// (1)模拟pod搶占排程過程,傳回pod可以搶占的node節點、被搶占的pod清單、需要去除nominateName屬性的pod清單
node, victims, nominatedPodsToClear, err := sched.Algorithm.Preempt(ctx, state, preemptor, scheduleErr)
if err != nil {
klog.Errorf("Error preempting victims to make room for %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
return "", err
}
var nodeName = ""
if node != nil {
nodeName = node.Name
sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)
// (2)請求apiserver,将可以搶占的node節點名稱設定到pod的nominatedNode屬性值中,然後該pod會重新進入待排程pod隊列,等待再一次排程
err = sched.podPreemptor.setNominatedNodeName(preemptor, nodeName)
if err != nil {
klog.Errorf("Error in preemption process. Cannot set 'NominatedPod' on pod %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
sched.SchedulingQueue.DeleteNominatedPodIfExists(preemptor)
return "", err
}
// (3)周遊被搶占的pod清單,請求apiserver,删除pod
for _, victim := range victims {
if err := sched.podPreemptor.deletePod(victim); err != nil {
klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
return "", err
}
// If the victim is a WaitingPod, send a reject message to the PermitPlugin
if waitingPod := fwk.GetWaitingPod(victim.UID); waitingPod != nil {
waitingPod.Reject("preempted")
}
sched.Recorder.Eventf(victim, preemptor, v1.EventTypeNormal, "Preempted", "Preempting", "Preempted by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
}
metrics.PreemptionVictims.Observe(float64(len(victims)))
}
// (4)周遊需要去除nominateName屬性的pod清單,請求apiserver,更新pod,去除pod的nominateName屬性值
for _, p := range nominatedPodsToClear {
rErr := sched.podPreemptor.removeNominatedNodeName(p)
if rErr != nil {
klog.Errorf("Cannot remove 'NominatedPod' field of pod: %v", rErr)
// We do not return as this error is not critical.
}
}
return nodeName, err
}
sched.Algorithm.Preempt
sched.Algorithm.Preempt方法模拟pod搶占排程過程,傳回pod可以搶占的node節點、被搶占的pod清單、需要去除NominatedNodeName屬性的pod清單,主要邏輯為:
(1)調用nodesWherePreemptionMightHelp,擷取預選失敗且移除部分pod之後可能可以滿足排程條件的節點;
(2)擷取PodDisruptionBudget對象,用于後續篩選可以被搶占的node節點清單(關于PodDisruptionBudget的用法,可自行搜尋資料檢視);
(3)調用g.selectNodesForPreemption,篩選可以被搶占的node節點清單,并傳回node節點上被搶占的pod的最小集合;
(4)周遊scheduler-extender(kube-scheduler的一種webhook擴充機制),執行extender的搶占處理邏輯,根據處理邏輯過濾可以被搶占的node節點清單;
(5)調用pickOneNodeForPreemption,從可被搶占的node節點清單中挑選出一個node節點;
(6)調用g.getLowerPriorityNominatedPods,擷取被搶占node節點上NominatedNodeName屬性不為空且優先級比搶占pod低的pod清單;
// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) Preempt(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
// Scheduler may return various types of errors. Consider preemption only if
// the error is of type FitError.
fitError, ok := scheduleErr.(*FitError)
if !ok || fitError == nil {
return nil, nil, nil, nil
}
if !podEligibleToPreemptOthers(pod, g.nodeInfoSnapshot.NodeInfoMap, g.enableNonPreempting) {
klog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
return nil, nil, nil, nil
}
if len(g.nodeInfoSnapshot.NodeInfoMap) == 0 {
return nil, nil, nil, ErrNoNodesAvailable
}
// (1)擷取預選失敗且移除部分pod之後可能可以滿足排程條件的節點;
potentialNodes := nodesWherePreemptionMightHelp(g.nodeInfoSnapshot.NodeInfoMap, fitError)
if len(potentialNodes) == 0 {
klog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
// In this case, we should clean-up any existing nominated node name of the pod.
return nil, nil, []*v1.Pod{pod}, nil
}
var (
pdbs []*policy.PodDisruptionBudget
err error
)
// (2)擷取PodDisruptionBudget對象,用于後續篩選可以被搶占的node節點清單(關于PodDisruptionBudget的用法,可自行搜尋資料檢視);
if g.pdbLister != nil {
pdbs, err = g.pdbLister.List(labels.Everything())
if err != nil {
return nil, nil, nil, err
}
}
// (3)擷取可以被搶占的node節點清單;
nodeToVictims, err := g.selectNodesForPreemption(ctx, state, pod, potentialNodes, pdbs)
if err != nil {
return nil, nil, nil, err
}
// (4)周遊scheduler-extender(kube-scheduler的一種webhook擴充機制),執行extender的搶占處理邏輯,根據處理邏輯過濾可以被搶占的node節點清單;
// We will only check nodeToVictims with extenders that support preemption.
// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
if err != nil {
return nil, nil, nil, err
}
// (5)從可被搶占的node節點清單中挑選出一個node節點;
candidateNode := pickOneNodeForPreemption(nodeToVictims)
if candidateNode == nil {
return nil, nil, nil, nil
}
// (6)擷取被搶占node節點上nominateName屬性不為空且優先級比搶占pod低的pod清單;
// Lower priority pods nominated to run on this node, may no longer fit on
// this node. So, we should remove their nomination. Removing their
// nomination updates these pods and moves them to the active queue. It
// lets scheduler find another place for them.
nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
if nodeInfo, ok := g.nodeInfoSnapshot.NodeInfoMap[candidateNode.Name]; ok {
return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, nil
}
return nil, nil, nil, fmt.Errorf(
"preemption failed: the target node %s has been deleted from scheduler cache",
candidateNode.Name)
}
3.1 nodesWherePreemptionMightHelp
nodesWherePreemptionMightHelp函數主要是傳回預選失敗且移除部分pod之後可能可以滿足排程條件的節點。
怎麼判斷某個預選失敗的node節點移除部分pod之後可能可以滿足排程條件呢?主要邏輯看到predicates.UnresolvablePredicateExists方法。
// pkg/scheduler/core/generic_scheduler.go
func nodesWherePreemptionMightHelp(nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, fitErr *FitError) []*v1.Node {
potentialNodes := []*v1.Node{}
for name, node := range nodeNameToInfo {
if fitErr.FilteredNodesStatuses[name].Code() == framework.UnschedulableAndUnresolvable {
continue
}
failedPredicates := fitErr.FailedPredicates[name]
// If we assume that scheduler looks at all nodes and populates the failedPredicateMap
// (which is the case today), the !found case should never happen, but we'd prefer
// to rely less on such assumptions in the code when checking does not impose
// significant overhead.
// Also, we currently assume all failures returned by extender as resolvable.
if predicates.UnresolvablePredicateExists(failedPredicates) == nil {
klog.V(3).Infof("Node %v is a potential node for preemption.", name)
potentialNodes = append(potentialNodes, node.Node())
}
}
return potentialNodes
}
3.1.1 predicates.UnresolvablePredicateExists
隻要預選算法執行失敗的node節點,其失敗的原因不屬于unresolvablePredicateFailureErrors中任何一個原因時,則該預選失敗的node節點移除部分pod之後可能可以滿足排程條件。
unresolvablePredicateFailureErrors包括節點NodeSelector不比對、pod反親和規則不符合、污點不容忍、節點屬于NotReady狀态、節點記憶體不足等等。
// pkg/scheduler/algorithm/predicates/error.go
var unresolvablePredicateFailureErrors = map[PredicateFailureReason]struct{}{
ErrNodeSelectorNotMatch: {},
ErrPodAffinityRulesNotMatch: {},
ErrPodNotMatchHostName: {},
ErrTaintsTolerationsNotMatch: {},
ErrNodeLabelPresenceViolated: {},
// Node conditions won't change when scheduler simulates removal of preemption victims.
// So, it is pointless to try nodes that have not been able to host the pod due to node
// conditions. These include ErrNodeNotReady, ErrNodeUnderPIDPressure, ErrNodeUnderMemoryPressure, ....
ErrNodeNotReady: {},
ErrNodeNetworkUnavailable: {},
ErrNodeUnderDiskPressure: {},
ErrNodeUnderPIDPressure: {},
ErrNodeUnderMemoryPressure: {},
ErrNodeUnschedulable: {},
ErrNodeUnknownCondition: {},
ErrVolumeZoneConflict: {},
ErrVolumeNodeConflict: {},
ErrVolumeBindConflict: {},
}
// UnresolvablePredicateExists checks if there is at least one unresolvable predicate failure reason, if true
// returns the first one in the list.
func UnresolvablePredicateExists(reasons []PredicateFailureReason) PredicateFailureReason {
for _, r := range reasons {
if _, ok := unresolvablePredicateFailureErrors[r]; ok {
return r
}
}
return nil
}
3.2 g.selectNodesForPreemption
g.selectNodesForPreemption方法,用于擷取可以被搶占的node節點清單,并傳回node節點上被搶占的pod的最小集合,主要邏輯如下:
(1)定義checkNode函數,主要是調用g.selectVictimsOnNode方法,方法傳回某node是否适合被搶占,并傳回該node節點上被搶占的pod的最小集合、被搶占pod中定義了PDB的pod數量;
(2)拉起16個goroutine,并發調用checkNode函數,對預選失敗的node節點清單進行是否适合被搶占的檢查;
// pkg/scheduler/core/generic_scheduler.go
// selectNodesForPreemption finds all the nodes with possible victims for
// preemption in parallel.
func (g *genericScheduler) selectNodesForPreemption(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
potentialNodes []*v1.Node,
pdbs []*policy.PodDisruptionBudget,
) (map[*v1.Node]*extenderv1.Victims, error) {
nodeToVictims := map[*v1.Node]*extenderv1.Victims{}
var resultLock sync.Mutex
// (1)定義checkNode函數
// We can use the same metadata producer for all nodes.
meta := g.predicateMetaProducer(pod, g.nodeInfoSnapshot)
checkNode := func(i int) {
nodeName := potentialNodes[i].Name
if g.nodeInfoSnapshot.NodeInfoMap[nodeName] == nil {
return
}
nodeInfoCopy := g.nodeInfoSnapshot.NodeInfoMap[nodeName].Clone()
var metaCopy predicates.Metadata
if meta != nil {
metaCopy = meta.ShallowCopy()
}
stateCopy := state.Clone()
stateCopy.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: metaCopy})
// 調用g.selectVictimsOnNode方法,方法傳回某node是否适合被搶占,并傳回該node節點上被搶占的pod的最小集合、與PDB沖突的pod數量;
pods, numPDBViolations, fits := g.selectVictimsOnNode(ctx, stateCopy, pod, metaCopy, nodeInfoCopy, pdbs)
if fits {
resultLock.Lock()
victims := extenderv1.Victims{
Pods: pods,
NumPDBViolations: int64(numPDBViolations),
}
nodeToVictims[potentialNodes[i]] = &victims
resultLock.Unlock()
}
}
// (2)拉起16個goroutine,并發調用checkNode函數,對預選失敗的node節點清單進行是否适合被搶占的檢查;
workqueue.ParallelizeUntil(context.TODO(), 16, len(potentialNodes), checkNode)
return nodeToVictims, nil
}
3.2.1 g.selectVictimsOnNode
g.selectVictimsOnNode方法用于判斷某node是否适合被搶占,并傳回該node節點上被搶占的pod的最小集合、被搶占pod中定義了PDB的pod數量。
主要邏輯:
(1)首先,假設把該node節點上比搶占pod優先級低的所有pod都删除掉,然後調用預選算法,看pod在該node上是否滿足排程條件,假如還是不符合排程條件,則該node節點不适合被搶占,直接return;
(2)将所有比搶占pod優先級低的pod按優先級高低進行排序,優先級最低的排在最前面;
(3)将排好序的pod清單按是否定義了PDB分成兩個pod清單;
(4)先周遊定義了PDB的pod清單,逐一删除pod(被删除的pod稱為被搶占pod),每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;
(5)假如周遊完定義了PDB的pod清單後,搶占pod在該node上任然不滿足排程條件,則繼續周遊沒有定義PDB的pod清單,逐一删除pod,每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;
(6)如果上述兩個pod清單裡的pod都被删除後,搶占pod在該node上任然不滿足排程條件,則該node不适合被搶占,return。
注意:以上說的删除pod并不是真正的删除,而是模拟删除後,搶占pod是否滿足排程條件而已。真正的删除被搶占pod的操作在後續确定了要搶占的node節點後,再删除該node節點上被搶占的pod。
// pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) selectVictimsOnNode(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
meta predicates.Metadata,
nodeInfo *schedulernodeinfo.NodeInfo,
pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, bool) {
var potentialVictims []*v1.Pod
removePod := func(rp *v1.Pod) error {
if err := nodeInfo.RemovePod(rp); err != nil {
return err
}
if meta != nil {
if err := meta.RemovePod(rp, nodeInfo.Node()); err != nil {
return err
}
}
status := g.framework.RunPreFilterExtensionRemovePod(ctx, state, pod, rp, nodeInfo)
if !status.IsSuccess() {
return status.AsError()
}
return nil
}
addPod := func(ap *v1.Pod) error {
nodeInfo.AddPod(ap)
if meta != nil {
if err := meta.AddPod(ap, nodeInfo.Node()); err != nil {
return err
}
}
status := g.framework.RunPreFilterExtensionAddPod(ctx, state, pod, ap, nodeInfo)
if !status.IsSuccess() {
return status.AsError()
}
return nil
}
// (1)首先,假設把該node節點上比搶占pod優先級低的所有pod都删除掉,然後調用預選算法,看pod在該node上是否滿足排程條件,假如還是不符合排程條件,則該node節點不适合被搶占,直接return
// As the first step, remove all the lower priority pods from the node and
// check if the given pod can be scheduled.
podPriority := podutil.GetPodPriority(pod)
for _, p := range nodeInfo.Pods() {
if podutil.GetPodPriority(p) < podPriority {
potentialVictims = append(potentialVictims, p)
if err := removePod(p); err != nil {
return nil, 0, false
}
}
}
// If the new pod does not fit after removing all the lower priority pods,
// we are almost done and this node is not suitable for preemption. The only
// condition that we could check is if the "pod" is failing to schedule due to
// inter-pod affinity to one or more victims, but we have decided not to
// support this case for performance reasons. Having affinity to lower
// priority pods is not a recommended configuration anyway.
if fits, _, _, err := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false); !fits {
if err != nil {
klog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err)
}
return nil, 0, false
}
var victims []*v1.Pod
numViolatingVictim := 0
// (2)将所有比搶占pod優先級低的pod按優先級高低進行排序,優先級最低的排在最前面;
sort.Slice(potentialVictims, func(i, j int) bool { return util.MoreImportantPod(potentialVictims[i], potentialVictims[j]) })
// Try to reprieve as many pods as possible. We first try to reprieve the PDB
// violating victims and then other non-violating ones. In both cases, we start
// from the highest priority victims.
// (3)将排好序的pod清單按是否定義了PDB分成兩個pod清單;
violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
reprievePod := func(p *v1.Pod) (bool, error) {
if err := addPod(p); err != nil {
return false, err
}
fits, _, _, _ := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false)
if !fits {
if err := removePod(p); err != nil {
return false, err
}
victims = append(victims, p)
klog.V(5).Infof("Pod %v/%v is a potential preemption victim on node %v.", p.Namespace, p.Name, nodeInfo.Node().Name)
}
return fits, nil
}
// (4)先周遊定義了PDB的pod清單,逐一删除pod(被删除的pod稱為被搶占pod),每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;
for _, p := range violatingVictims {
if fits, err := reprievePod(p); err != nil {
klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
return nil, 0, false
} else if !fits {
numViolatingVictim++
}
}
// (5)假如周遊完定義了PDB的pod清單後,搶占pod在該node上任然不滿足排程條件,則繼續周遊沒有定義PDB的pod清單,逐一删除pod,每删除一個pod,排程預選算法,看pod在該node上是否滿足排程條件,如滿足則直接傳回該node适合被搶占、被搶占的pod清單、被搶占pod中定義了PDB的pod數量;
// Now we try to reprieve non-violating victims.
for _, p := range nonViolatingVictims {
if _, err := reprievePod(p); err != nil {
klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
return nil, 0, false
}
}
// (6)如果上述兩個pod清單裡的pod都被删除後,搶占pod在該node上任然不滿足排程條件,則該node不适合被搶占,return。
return victims, numViolatingVictim, true
}
3.3 pickOneNodeForPreemption
pickOneNodeForPreemption函數,從可被搶占的node節點清單中挑選出一個node節點,該函數将按順序參照下列規則來挑選最優的被搶占node,直到某個條件能夠選出唯一的一個node節點:
(1)node節點沒有被搶占pod的,優先選擇;
(2)被搶占pod中定義了PDB的pod數量最少的節點;
(3)高優先級pod數量最少的節點;
(4)對node節點上所有被搶占pod的優先級進行相加,選取其值最小的節點;
(5)選擇被搶占pod數量最少的node節點;
(6)選擇被搶占pod中運作時間最短的pod所在node節點;
(7)傳回符合上述條件的最後一個node節點;
// pkg/scheduler/core/generic_scheduler.go
func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*extenderv1.Victims) *v1.Node {
if len(nodesToVictims) == 0 {
return nil
}
minNumPDBViolatingPods := int64(math.MaxInt32)
var minNodes1 []*v1.Node
lenNodes1 := 0
for node, victims := range nodesToVictims {
// (1)node節點沒有被搶占pod的,優先選擇
if len(victims.Pods) == 0 {
// We found a node that doesn't need any preemption. Return it!
// This should happen rarely when one or more pods are terminated between
// the time that scheduler tries to schedule the pod and the time that
// preemption logic tries to find nodes for preemption.
return node
}
//(2)與PDB沖突的pod數量最少的節點
numPDBViolatingPods := victims.NumPDBViolations
if numPDBViolatingPods < minNumPDBViolatingPods {
minNumPDBViolatingPods = numPDBViolatingPods
minNodes1 = nil
lenNodes1 = 0
}
if numPDBViolatingPods == minNumPDBViolatingPods {
minNodes1 = append(minNodes1, node)
lenNodes1++
}
}
if lenNodes1 == 1 {
return minNodes1[0]
}
// (3)高優先級pod數量最少的節點
// There are more than one node with minimum number PDB violating pods. Find
// the one with minimum highest priority victim.
minHighestPriority := int32(math.MaxInt32)
var minNodes2 = make([]*v1.Node, lenNodes1)
lenNodes2 := 0
for i := 0; i < lenNodes1; i++ {
node := minNodes1[i]
victims := nodesToVictims[node]
// highestPodPriority is the highest priority among the victims on this node.
highestPodPriority := podutil.GetPodPriority(victims.Pods[0])
if highestPodPriority < minHighestPriority {
minHighestPriority = highestPodPriority
lenNodes2 = 0
}
if highestPodPriority == minHighestPriority {
minNodes2[lenNodes2] = node
lenNodes2++
}
}
if lenNodes2 == 1 {
return minNodes2[0]
}
// (4)對node節點上所有被搶占pod的優先級進行相加,選取其值最小的節點
// There are a few nodes with minimum highest priority victim. Find the
// smallest sum of priorities.
minSumPriorities := int64(math.MaxInt64)
lenNodes1 = 0
for i := 0; i < lenNodes2; i++ {
var sumPriorities int64
node := minNodes2[i]
for _, pod := range nodesToVictims[node].Pods {
// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
// needed so that a node with a few pods with negative priority is not
// picked over a node with a smaller number of pods with the same negative
// priority (and similar scenarios).
sumPriorities += int64(podutil.GetPodPriority(pod)) + int64(math.MaxInt32+1)
}
if sumPriorities < minSumPriorities {
minSumPriorities = sumPriorities
lenNodes1 = 0
}
if sumPriorities == minSumPriorities {
minNodes1[lenNodes1] = node
lenNodes1++
}
}
if lenNodes1 == 1 {
return minNodes1[0]
}
// (5)選擇被搶占pod數量最少的node節點;
// There are a few nodes with minimum highest priority victim and sum of priorities.
// Find one with the minimum number of pods.
minNumPods := math.MaxInt32
lenNodes2 = 0
for i := 0; i < lenNodes1; i++ {
node := minNodes1[i]
numPods := len(nodesToVictims[node].Pods)
if numPods < minNumPods {
minNumPods = numPods
lenNodes2 = 0
}
if numPods == minNumPods {
minNodes2[lenNodes2] = node
lenNodes2++
}
}
if lenNodes2 == 1 {
return minNodes2[0]
}
// (6)選擇被搶占pod中運作時間最短的pod所在node節點;
// There are a few nodes with same number of pods.
// Find the node that satisfies latest(earliestStartTime(all highest-priority pods on node))
latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
if latestStartTime == nil {
// If the earliest start time of all pods on the 1st node is nil, just return it,
// which is not expected to happen.
klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", minNodes2[0])
return minNodes2[0]
}
nodeToReturn := minNodes2[0]
for i := 1; i < lenNodes2; i++ {
node := minNodes2[i]
// Get earliest start time of all pods on the current node.
earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node])
if earliestStartTimeOnNode == nil {
klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", node)
continue
}
if earliestStartTimeOnNode.After(latestStartTime.Time) {
latestStartTime = earliestStartTimeOnNode
nodeToReturn = node
}
}
// (7)傳回符合上述條件的最後一個node節點
return nodeToReturn
}
總結
kube-scheduler搶占邏輯流程圖
下方處理流程圖展示了kube-scheduler搶占邏輯的核心處理步驟,在開始搶占邏輯處理之前,會先進行搶占排程功能是否開啟的判斷。
原文位址:https://www.cnblogs.com/lianngkyle/p/16000742.html