1.排程器作用
排程器通過 kubernetes 的 watch 機制來發現叢集中新建立且尚未被排程到 Node 上的 Pod。排程器會将發現的每一個未排程的 Pod 排程到一個合适的 Node 上來運作。
kube-scheduler 是 Kubernetes 叢集的預設排程器,并且是叢集控制面的一部分。如果你真的希望或者有這方面的需求,kube-scheduler 在設計上是允許你自己寫一個排程元件并替換原有的 kube-scheduler。
在做排程決定時需要考慮的因素包括:單獨和整體的資源請求、硬體/軟體/政策限制、親和以及反親和要求、資料局域性、負載間的幹擾等等。
預設政策可以參考:https://kubernetes.io/zh/docs/concepts/scheduling/kube-scheduler/
排程架構:https://kubernetes.io/zh/docs/concepts/configuration/scheduling-framework/
2. nodeName
nodeName 是節點選擇限制的最簡單方法,但一般不推薦。
如果 nodeName 在 PodSpec 中指定了,則它優先于其他的節點選擇方法。
使用 nodeName 來選擇節點的一些限制:
如果指定的節點不存在。
如果指定的節點沒有資源來容納 pod,則pod 排程失敗。
雲環境中的節點名稱并非總是可預測或穩定的。
示例
[[email protected] ~]# cd sduler/
[[email protected] sduler]# vim pod.yml
[[email protected] sduler]# cat pod.yml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: server3 ##指定pod排程到server3節點
[[email protected] sduler]# kubectl apply -f pod.yml
pod/nginx created
[[email protected] sduler]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 13s 10.244.1.30 server3 <none> <none>
[[email protected] sduler]#
測試:指定節點沒有資源來容納 pod,則pod 排程失敗
3. nodeSelector
nodeSelector 是節點選擇限制的最簡單推薦形式。
給選擇的節點添加标簽:
kubectl label nodes server2 disktype=ssd
添加 nodeSelector 字段到 pod 配置中:
[[email protected] sduler]# kubectl label nodes server3 disktype=ssd ##添加标簽到節點server3
node/server3 labeled
[[email protected] sduler]# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
server2 Ready master 5d15h v1.18.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=server2,kubernetes.io/os=linux,node-role.kubernetes.io/master=
server3 Ready <none> 5d15h v1.18.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=server3,kubernetes.io/os=linux
server4 Ready <none> 5d15h v1.18.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=server4,kubernetes.io/os=linux
[[email protected] sduler]# vim pod.yml
[[email protected] sduler]# cat pod.yml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeSelector: ##按照所添加的标簽選擇節點
disktype: ssd
[[email protected] sduler]# kubectl apply -f pod.yml
pod/nginx created
[[email protected] sduler]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 19s 10.244.1.31 server3 <none> <none>
[[email protected] sduler]#
未比對到标簽時,pod将一直處于pending狀态
4. 親和和與反親和
nodeSelector 提供了一種非常簡單的方法來将 pod 限制到具有特定标簽的節點上。
親和/反親和功能極大地擴充了你可以表達限制的類型。
你可以發現規則是“軟”/“偏好”,而不是硬性要求,是以,如果排程器無法滿足該要求,仍然排程該 pod
你可以使用節點上的 pod 的标簽來限制,而不是使用節點本身的标簽,來允許哪些 pod 可以或者不可以被放置在一起。
參考:https://kubernetes.io/zh/docs/concepts/configuration/assign-pod-node/
4.1 節點親和
requiredDuringSchedulingIgnoredDuringExecution 必須滿足
preferredDuringSchedulingIgnoredDuringExecution 傾向滿足
IgnoreDuringExecution 表示如果在Pod運作期間Node的标簽發生變化,導緻親和性政策不能滿足,則繼續運作目前的Pod。
nodeaffinity還支援多種規則比對條件的配置如
In:label 的值在清單内
NotIn:label 的值不在清單内
Gt:label 的值大于設定的值,不支援Pod親和性
Lt:label 的值小于設定的值,不支援pod親和性
Exists:設定的label 存在
DoesNotExist:設定的 label 不存在
節點親和性pod示例一:
[[email protected] sduler]# vim pod.yml
[[email protected] sduler]# cat pod.yml
apiVersion: v1
kind: Pod
metadata:
name: node-affinity
spec:
containers:
- name: nginx
image: nginx
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
[[email protected] sduler]# kubectl apply -f pod.yml
pod/node-affinity created
[[email protected] sduler]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-affinity 1/1 Running 0 10s 10.244.1.32 server3 <none> <none>
[[email protected] sduler]#
示例二:
[[email protected] sduler]# vim pod.yml
[[email protected] sduler]# cat pod.yml
apiVersion: v1
kind: Pod
metadata:
name: node-affinity
spec:
containers:
- name: nginx
image: nginx
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: ##必須滿足
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- server1
preferredDuringSchedulingIgnoredDuringExecution: ##傾向滿足
- weight: 1
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
[[email protected] sduler]# kubectl apply -f pod.yml
pod/node-affinity created
[[email protected] sduler]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
node-affinity 1/1 Running 0 5s 10.244.1.33 server3 <none> <none>
[[email protected] sduler]#
4.2 pod 親和性和反親和性
podAffinity 主要解決POD可以和哪些POD部署在同一個拓撲域中的問題(拓撲域用主機标簽實作,可以是單個主機,也可以是多個主機組成的cluster、zone等。)
podAntiAffinity主要解決POD不能和哪些POD部署在同一個拓撲域中的問題。它們處理的是Kubernetes叢集内部POD和POD之間的關系。
Pod 間親和與反親和在與更進階别的集合(例如 ReplicaSets,StatefulSets,Deployments 等)一起使用時,它們可能更加有用。可以輕松配置一組應位于相同定義拓撲(例如,節點)中的工作負載。
pod親和性示例:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
---
apiVersion: v1
kind: Pod
metadata:
name: mysql
labels:
app: mysql
spec:
containers:
- name: mysql
image: mysql
env:
- name: "MYSQL_ROOT_PASSWORD"
value: "westos"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
[[email protected] sduler]# vim pod.yml
[[email protected] sduler]# kubectl apply -f pod.yml
pod/nginx created
pod/mysql created
[[email protected] sduler]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mysql 1/1 Running 0 25s 10.244.1.36 server3 <none> <none>
nginx 1/1 Running 0 25s 10.244.1.35 server3 <none> <none>
[[email protected] sduler]#
設定排程,mysql跟随nginx的pod建立,建立pod到相應的标簽
pod反親和性示例:
[[email protected] sduler]# vim pod.yml
[[email protected] sduler]# cat pod.yml
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: server3
---
apiVersion: v1
kind: Pod
metadata:
name: mysql
labels:
app: mysql
spec:
containers:
- name: mysql
image: mysql:5.7
env:
- name: "MYSQL_ROOT_PASSWORD"
value: "westos"
affinity:
podAntiAffinity: ##反親和性
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
[[email protected] sduler]# kubectl apply -f pod.yml
pod/nginx created
pod/mysql created
[[email protected] sduler]# kubectl get pod -o wide ##mysql和nginx不在同一節點
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mysql 1/1 Running 0 14s 10.244.2.32 server4 <none> <none>
nginx 1/1 Running 0 14s 10.244.1.37 server3 <none> <none>
[[email protected] sduler]#
5. Node屬性Taints污點、Pod屬性Tolerations容忍
5.1污點、容忍概述
NodeAffinity節點親和性,是Pod上定義的一種屬性,使Pod能夠按我們的要求排程到某個Node上,而Taints則恰恰相反,它可以讓Node拒絕運作Pod,甚至驅逐Pod。
Taints(污點)是Node的一個屬性,設定了Taints後,是以Kubernetes是不會将Pod排程到這個Node上的,于是Kubernetes就給Pod設定了個屬性Tolerations(容忍),隻要Pod能夠容忍Node上的污點,那麼Kubernetes就會忽略Node上的污點,就能夠(不是必須)把Pod排程過去。
可以使用指令 kubectl taint 給節點增加一個 taint:
kubectl taint nodes node1 key=value:NoSchedule //建立
kubectl describe nodes server1 |grep Taints //查詢
kubectl taint nodes node1 key:NoSchedule- //删除
其中[effect] 可取值: [ NoSchedule | PreferNoSchedule | NoExecute ]
NoSchedule:POD 不會被排程到标記為 taints 節點。
PreferNoSchedule:NoSchedule 的軟政策版本。
NoExecute:該選項意味着一旦 Taint 生效,如該節點内正在運作的 POD 沒有對應 Tolerate 設定,會直接被逐出
[[email protected] sduler]# kubectl describe nodes server2 | grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
[[email protected] sduler]# kubectl describe nodes server3 | grep Taints
Taints: <none>
[[email protected] sduler]# kubectl describe nodes server4 | grep Taints
Taints: <none>
[[email protected] sduler]#
部署myapp deployment示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-v1
spec:
replicas: 2
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:v1
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp
topologyKey: kubernetes.io/hostname
[[email protected] sduler]# vim deployment.yml
[[email protected] sduler]# kubectl apply -f deployment.yml
deployment.apps/deployment-v1 created
[[email protected] sduler]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deployment-v1-6498765b4b-59ncg 1/1 Running 0 22s 10.244.1.39 server3 <none> <none>
deployment-v1-6498765b4b-rkpc5 1/1 Running 0 22s 10.244.2.34 server4 <none> <none>
mysql 1/1 Running 0 82m 10.244.2.32 server4 <none> <none>
nginx 1/1 Running 0 82m 10.244.1.37 server3 <none> <none>
[[email protected] sduler]#
5.2 污點的添加與容忍設定
tolerations中定義的key、value、effect,要與node上設定的taint保持一直:
如果 operator 是 Exists ,value可以省略。
如果 operator 是 Equal ,則key與value之間的關系必須相等。
如果不指定operator屬性,則預設值為Equal。
還有兩個特殊值:
當不指定key,再配合Exists 就能比對所有的key與value ,可以容忍所有污點。
當不指定effect ,則比對所有的effect。
添加污點
[[email protected] sduler]# kubectl taint node server3 node-role.kubernetes.io/master:NoSchedule ##給Server3節點打上taint
node/server3 tainted
[[email protected] sduler]# kubectl describe nodes server3 |grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
[[email protected] sduler]# kubectl apply -f deployment.yml
deployment.apps/deployment-v1 created
[[email protected] sduler]# kubectl get pod -o wide ##可以看到server3上的Pod被驅離:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deployment-v1-6498765b4b-ds7s7 1/1 Running 0 8s 10.244.2.36 server4 <none> <none>
deployment-v1-6498765b4b-vqcld 0/1 Pending 0 8s <none> <none> <none> <none>
可以看到server3上的Pod被驅離
在PodSpec中為容器設定容忍标簽:
tolerations:
- operator: "Exists"
effect: "NoSchedule"
為Pod設定容忍後,server3又可以運作Pod了
6. 影響pod排程的指令
影響Pod排程的指令還有:cordon、drain、delete,後期建立的pod都不會被排程到該節點上,但操作的暴力程度不一樣。
cordon 停止排程:
影響最小,隻會将node調為SchedulingDisabled,新建立pod,不會被排程到該節點,節點原有pod不受影響,仍正常對外提供服務。
[[email protected] sduler]# kubectl cordon server3
node/server3 cordoned
[[email protected] sduler]# kubectl get no
NAME STATUS ROLES AGE VERSION
server2 Ready master 5d18h v1.18.5
server3 Ready,SchedulingDisabled <none> 5d18h v1.18.5
server4 Ready <none> 5d18h v1.18.5
[[email protected] sduler]#
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-v1
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:v1
[[email protected] sduler]# vim deployment.yml
[[email protected] sduler]# kubectl apply -f deployment.yml
deployment.apps/deployment-v1 created
[[email protected] sduler]# kubectl get pod -o wide ##server3沒有被排程
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deployment-v1-7449b5b68f-5zvj6 1/1 Running 0 12s 10.244.2.38 server4 <none> <none>
deployment-v1-7449b5b68f-89bn5 1/1 Running 0 12s 10.244.2.40 server4 <none> <none>
deployment-v1-7449b5b68f-rpqb4 1/1 Running 0 12s 10.244.2.39 server4 <none> <none>
[[email protected] sduler]#
應用yaml檔案後,叢集中server3節點沒有被排程
恢複server3節點的工作狀态
[[email protected] sduler]# kubectl uncordon server3
node/server3 uncordoned
[[email protected] sduler]# kubectl get no
NAME STATUS ROLES AGE VERSION
server2 Ready master 5d18h v1.18.5
server3 Ready <none> 5d18h v1.18.5
server4 Ready <none> 5d18h v1.18.5
[[email protected] sduler]#
drain 驅逐節點:
首先驅逐node上的pod,在其他節點重新建立,然後将節點調為SchedulingDisabled。
[[email protected] sduler]# kubectl drain server3 --ignore-daemonsets
node/server3 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/kube-flannel-ds-amd64-zx97k, kube-system/kube-proxy-l2cz5
evicting pod kube-system/coredns-bd97f9cd9-vzw6w
pod/coredns-bd97f9cd9-vzw6w evicted
node/server3 evicted
[[email protected] sduler]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
server2 Ready master 5d18h v1.18.5
server3 Ready,SchedulingDisabled <none> 5d18h v1.18.5
server4 Ready <none> 5d18h v1.18.5
[[email protected] sduler]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deployment-v1-7449b5b68f-5zvj6 1/1 Running 0 7m31s 10.244.2.38 server4 <none> <none>
deployment-v1-7449b5b68f-89bn5 1/1 Running 0 7m31s 10.244.2.40 server4 <none> <none>
deployment-v1-7449b5b68f-rpqb4 1/1 Running 0 7m31s 10.244.2.39 server4 <none> <none>
[[email protected] sduler]#
恢複server3節點的工作狀态
delete 删除節點:
最暴力的一個,首先驅逐node上的pod,在其他節點重新建立,然後,從master節點删除該node,master失去對其控制,如要恢複排程,需進入node節點,重新開機kubelet服務