linux CFS Scheduler

學習Completely Fair Scheduler (CFS)

1. Linux Scheduler

From Linux’s first version in 1991 through the 2.4 kernel series, the Linux scheduler was simple in design. It was easy to understand, but scaled poorly in light of many runnable processes or many processors.

During the 2.5 kernel development series, the O(1) scheduler solved the shortcomings of the previous Linux scheduler and introduced powerful new features and performance characteristics. By introducing a constant-time algorithm for timeslice calculation and per-processor runqueues, it rectified the design limitations of the earlier scheduler.

However, the O(1) scheduler had several pathological failures related to scheduling latency-sensitive applications (interactive processes). Thus, although the O(1) scheduler was ideal for large server workloads, which lack interactive processes, it performed below par on desktop systems, where interactive applications are the raison d’être.

Beginning in the 2.6 kernel series, developers introduced new process schedulers aimed at improving the interactive performance of the O(1) scheduler. The most notable of these was the Rotating Staircase Deadline scheduler, which introduced the concept of fair scheduling, borrowed from queuing theory, to Linux’s process scheduler. This concept was the inspiration for the O(1) scheduler’s eventual replacement in kernel version 2.6.23, the Completely Fair Scheduler (CFS).

2. Policy

Policy is the behavior of the scheduler that determines what runs when.

2.1. I/O-Bound Versus Processor-Bound Processes

Processes can be classified as either I/O-bound or processor-bound.

An I/O-bound process spends much of its time submitting and waiting on I/O requests. Such a process is runnable for only short durations, because it eventually blocks waiting on more I/O.
- “I/O” means any type of blockable resource, such as keyboard input or network I/O, and not just disk I/O. Most graphical user interface (GUI) applications are I/O-bound, even if they never read from or write to the disk, because they spend most of their time waiting on user interaction via the keyboard and mouse.
Processor-bound processes spend much of their time executing code. Thet tend to run until they are preempted because they do not block on I/O requests very often. System response does not dictate that the scheduler run them often. A scheduler policy for processor-bound processes tends to run such processes less frequently but for longer durations.
- Examples of processor-bound processes include: a program executing an infinite loop, ssh-keygen, MATLAB.

These classifications are not mutually exclusive. Processes can exhibit both behaviors simultaneously:

The X Window server is both processor and I/O intense.
A word processor can be I/O-bound but dive into periods of intense processor action.

The scheduling policy in a system must attempt to satisfy two conflicting goals:

Fast process response time (low latency)
Maximal system utilization (high throughput)

Favoring I/O-bound over processor-bound

Schedulers often employ complex algorithms to determine the most worthwhile process to run while not compromising fairness to other processes with lower priority.

The scheduler policy in Unix systems tends to explicitly favor I/O-bound processes, thus providing good process response time.
Linux, aiming to provide good interactive response and desktop performance, optimizes for process response (low latency), thus favoring I/O-bound processes over processor-bound processes. This is done in a creative manner that does not neglect processor-bound processes.

2.2. Process Priority

The priority-based scheduling is a common type of scheduling algorithm, which isn’t exactly implemented on Linux. It means that processes with a higher priority run before those with a lower priority, whereas processes with the same priority are scheduled round-robin (one after the next, repeating). On some systems, processes with a higher priority also receive a longer timeslice. The runnable process with timeslice remaining and the highest priority always runs.

nice value and real-time priority

The Linux kernel implements two separate priority ranges:

nice value (from –20 to +19 with a default of 0) is the standard priority range.
- Processes with a lower nice value (higher priority) receive a larger proportion of the system’s processor, and vice versa.
- In Linux, the nice value is a control over the proportion of timeslice.
- The “ps -el” lists processes with their nice values.
Real-time priority (configurable values that by default range from 0 to 99)
- Higher real-time priority values correspond to a greater priority.
- All real-time processes are at a higher priority than normal processes.
- The “ps -eo state,uid,pid,ppid,rtprio,time,comm” lists processes and their real-time priority. A value of “-” means the process is not real-time.

2.3. Timeslice

The timeslice is the numeric value that represents how long a task can run until it is preempted.

Too long a timeslice causes the system to have poor interactive performance.
Too short a timeslice causes significant amounts of processor time to be wasted on the overhead of switching processes.

The conflicting goals of I/O bound versus processor-bound processes:

I/O-bound processes do not need longer timeslices (although they do like to run often)
Processor-bound processes crave long timeslices (to keep their caches hot).

Timeslice on Linux

Linux’s CFS scheduler does not directly assign timeslices to processes, but assigns processes a proportion of the processor. The amount of processor time that a process receives is a function of the load of the system. This assigned proportion is further affected by each process’s nice value. The nice value acts as a weight, changing the proportion of the processor time each process receives. Processes with higher nice values (a lower priority) receive a deflationary weight, yielding them a smaller proportion of the processor, and vice versa.

With the CFS scheduler, whether the process runs immediately (preempting the currently running process) is a function of how much of a proportion of the processor the newly runnable processor has consumed. If it has consumed a smaller proportion of the processor than the currently executing process, it runs immediately.

2.4.Priority

linux CFS Scheduler

2.4.1.使用者優先級

從使用者空間角度看，程序優先級有兩種含義：nice value和scheduling priority。

普通程序，優先級範圍[-20,19]，值越小，優先級越高。nice用于在啟動一個程序的同時設定其初始優先級值，而renice可以在一個程序運作過程中調整其優先級值。
實時程序，優先級範圍[1,99]，值越大，優先級越高。實時程序優先級範圍通過sched_get_priority_min和sched_get_priority_max獲得。普通程序也有scheduling priority，被設定為0。

2.4.2. 核心優先級

核心排程使用另一套衡量程序優先級的标準，規定程序的優先級的範圍為[0, 139]。其中實時任務優先級範圍[0, 99]，普通程序的優先級範圍是[100, 139]。優先級值越小，任務優先被核心排程。

//include/linux/sched.h 
 struct task_struct {
    ...     
    int             prio;
    int             static_prio;
    int             normal_prio;
    unsigned int            rt_priority;
    unsigned int policy; 
    ...
 }

1.static_prio(靜态優先級)

nice()對應系統調用sys_nice()，該函數會将nice()函數傳入的[-20,19]範圍的值映射為[100,139]之間的值。nice值和static_prio之間存在以下映射關系：

static_prio = nice + 20 + MAX_RT_PRIO

範圍為100-139（預設值是 120），用于普通程序；
值越小，程序優先級越高；
使用者空間可以通過nice()或者setpriority進行修改該值，getpriority可以擷取該值。
新建立的程序會繼承父程序的static priority。

核心定義兩個宏來實作此轉化：

//include/linux/sched/prio.h 
#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO     MAX_USER_RT_PRIO

#define MAX_PRIO        (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO        (MAX_RT_PRIO + NICE_WIDTH / 2)  //預設靜态優先級120

/*
 * Convert user-nice values [ -20 ... 0 ... 19 ]
 * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
 * and back.
 */
#define NICE_TO_PRIO(nice)  ((nice) + DEFAULT_PRIO)                                                           
#define PRIO_TO_NICE(prio)  ((prio) - DEFAULT_PRIO)

2.rt_priority

取值範圍為[0,99]，值越大優先級越高。使用者層可以通過系統調用函數sched_setscheduler()對其進行設定。使用者層可以通過系統調用函數sched_getparam()擷取rt_priority的值。

3.normal_prio

normal_prio是基于static_prio或rt_priority計算出來。static_prio和rt_priority分别代表普通程序和實時程序的”靜态優先級”，代表程序的固有屬性。由于一個是值越小優先級越高，另一個是值越大優先級越高。是以用normal_prio統一成值越小優先級越高。

* Calculate the expected normal priority: i.e. priority
     * without taking RT-inheritance into account. Might be
     * boosted by interactivity modifiers. Changes upon fork,
     * setprio syscalls, and whenever the interactivity
     * estimator recalculates.
     */
    static inline int normal_prio(struct task_struct *p)
    {
        int prio;
        if (task_has_rt_policy(p))
            prio = MAX_RT_PRIO-1 - p->rt_priority;
        else
            prio = __normal_prio(p);
        return prio;
    }
    /*
     * __normal_prio - return the priority that is based on the static prio
     */
    static inline int __normal_prio(struct task_struct *p)
    {
        return p->static_prio;
    }

結論是：

對實時程序，其 normal_prio 會被統一為按照值越小，優先級越高。值的範圍依然在[0,99]。
對普通程序，其 normal_prio 取的就是它的 static_prio 。

4.prio（動态優先級）

程序的有效優先級(effective priority)，在核心中排程器判斷程序優先級時使用該參數，其取值範圍為[0,139]，值越小，優先級越低。該優先級又叫”動态優先級”。該優先級通過函數effective_prio()計算出來的。

2.5.優先級->權重轉換表

CFS引入權重概念，權重代表程序的優先級。各個程序之間根據權重的比例配置設定cpu時間。

例如：2個程序A和B，A的權重1024，B的權重2048。那麼A獲得cpu的時間比例是1024/(1024+2048) = 33.3%。B程序獲得的cpu時間比例是2048/(1024+2048)=66.7%。可以看出，權重越大配置設定的時間比例越大，相當于優先級越高。

一個程序在一個排程周期中的實際運作時間:

配置設定給程序的運作時間 = 排程周期 * 程序權重 / 所有可運作程序權重之和

程序每減少一個nice值，權重增加，則多獲得10%的CPU時間；每增加一個nice值，權重減少，則放棄10%的CPU時間。為執行該政策，核心需要将優先級轉換為權重值，并提供了一張優先級->權重轉換表sched_prio_to_weight，核心不僅維護了負荷權重自身，還儲存另外一個數值，用于負荷重除的結果，即sched_prio_to_wmult數組，這兩個數組中的資料是一一對應的。

const int sched_prio_to_weight[40] = {
/* -20 */     88761,     71755,     56483,     46273,     36291,
/* -15 */     29154,     23254,     18705,     14949,     11916,
/* -10 */      9548,      7620,      6100,      4904,      3906,
/*  -5 */      3121,      2501,      1991,      1586,      1277,
/*   0 */      1024,       820,       655,       526,       423,
/*   5 */       335,       272,       215,       172,       137,
/*  10 */       110,        87,        70,        56,        45,
/*  15 */        36,        29,        23,        18,        15,
};

const u32 sched_prio_to_wmult[40] = {
/* -20 */     48388,     59856,     76040,     92818,    118348,
/* -15 */    147320,    184698,    229616,    287308,    360437,
/* -10 */    449829,    563644,    704093,    875809,   1099582,
/*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
/*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
/*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
/*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
/*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

權重和nice公式：

linux CFS Scheduler

例如：

兩個程序A和B在nice級别0，即靜态優先級120運作，是以兩個程序的CPU份額相同，都是50%，nice級别為0的程序，查其權重表可知是1024。每個程序的份額是1024/(1024+1024)=0.5，即50%。

如果程序B的優先級+1(優先級降低), 成為nice=1, 那麼其CPU份額應該減少10%, 換句話說程序A得到的總的CPU應該是55%, 而程序B應該是45%. 優先級增加1導緻權重減少, 即1024/1.25=820, 而程序A仍舊是1024, 則程序A現在将得到的CPU份額是1024/(1024+820=0.55, 而程序B的CPU份額則是820/(1024+820)=0.45. 這樣就正好産生了10%的內插補點。

sched_prio_to_wmult和sched_prio_to_weight公式：

linux CFS Scheduler

3. CFS基本原理

基本原理：如果目前有n個程序需要排程執行，那麼排程器應該在一個比較小的時間範圍内，把這n個程序全都排程執行一遍，并且它們平分cpu時間，這樣就可以做到所有程序的公平排程。那麼這個比較小的時間就是任意一個R狀态程序被排程的最大延時時間，即：任意一個R狀态程序，都一定會在這個時間範圍内被排程。這個時間叫做排程周期（sched_latency_ns）。程序越多，每個程序在周期内被執行的時間就會被平分的越小。排程器隻需要對所有程序維護一個累積占用CPU時間數，就可以衡量出每個程序目前占用的CPU時間總量是不是過大或者過小，這個數字記錄在每個程序的vruntime中。所有待執行程序都以vruntime為key放到一個由紅黑樹組成的隊列中，每次被排程執行的程序，都是這個紅黑樹的最左子樹上的那個程序，即vruntime時間最少的程序，這樣就保證了所有程序的相對公平。

3.1.排程器配置參數

1.排程延遲時間

CFS排程器的排程延遲時間不是固定的。當系統處于就緒态的程序少于一個定值（預設值8）的時候，排程延遲也是固定一個值不變（預設值6ms）。當系統就緒态程序個數超過這個值時，我們保證每個程序至少運作一定的時間才讓出cpu。這個“至少一定的時間”被稱為最小粒度時間。在CFS預設設定中，最小粒度時間是0.75ms。用變量sysctl_sched_min_granularity記錄。是以，排程周期是一個動态變化的值。排程周期計算函數是__sched_period()。

sched_latency_ns：This tuneable decides the scheduler period, the period in which all run queue tasks are scheduled at least once.
sched_min_granularity_ns ：This tuneable decides the minimum time a task will be be allowed to run on CPU before being pre-empted out.

scheduler period有兩種情況：

number of runnable tasks < sched_nr_latency

scheduler period = sched_latency_ns
number of runnable tasks > sched_nr_latency

scheduler period = number_of_running_tasks * sched_min_granularity_ns

54 unsigned int sysctl_sched_latency           = 6000000ULL;                                              
    55 unsigned int normalized_sysctl_sched_latency        = 6000000ULL;
    
    75 unsigned int sysctl_sched_min_granularity       = 750000ULL;                                           
    76 unsigned int normalized_sysctl_sched_min_granularity    = 750000ULL;
    81 static unsigned int sched_nr_latency = 8;

   662 static u64 __sched_period(unsigned long nr_running)
   663 {
   664     if (unlikely(nr_running > sched_nr_latency))
   665         return nr_running * sysctl_sched_min_granularity;
   666     else
   667         return sysctl_sched_latency;
   668 }

3.2.Vruntime

Virtual run time is the weighted time a task has run on the CPU.vruntime is measured in nano seconds.

一個程序的實際運作時間和虛拟運作時間之間的關系為:

vriture_runtime = wall_time(實際運作時間 ) * (NICE_0_LOAD / weight).

(NICE_0_LOAD = 1024, 表示nice值為0的程序權重)

如下所示，程序權重越大, 運作同樣的實際時間, vruntime增長的越慢。

linux CFS Scheduler

一個程序在一個排程周期内的虛拟運作時間大小為:

vruntime = 程序在一個排程周期内的實際運作時間 * NICE_0_LOAD / 程序權重

= (排程周期 * 程序權重 / 所有程序總權重) * NICE_0_LOAD / 程序權重

= 排程周期 * NICE_0_LOAD / 所有可運作程序總權重

如上所示，一個程序在一個排程周期内的vruntime值大小是不和該程序自己的權重相關的，是以所有程序的vruntime值大小都是一樣的。

3.3.就緒隊列（runqueue）

系統中每個CPU都會有一個全局的就緒隊列（cpu runqueue），使用struct rq結構體描述，它是per-cpu類型，即每個cpu上都會有一個struct rq結構體。每一個排程類也有屬于自己管理的就緒隊列。例如，struct cfs_rq是CFS排程類的就緒隊列，管理就緒态的struct sched_entity排程實體，後續通過pick_next_task接口從就緒隊列中選擇最适合運作的排程實體（虛拟時間最小的排程實體）。struct rt_rq是實時排程器就緒隊列。struct dl_rq是Deadline排程器就緒隊列。

參考資料：

https://notes.shichao.io/lkd/ch4/

http://www.wowotech.net/process_management/process-priority.html

linux CFS Scheduler

繼續閱讀

Apache (You don't have permission to access / on this server.）

debian9更新4.9.0核心到4.19.2核心過程

centOS7 配置 vsftpd 虛拟使用者及權限Vsftpd配置虛拟使用者及權限

linux-svn解除安裝與安裝

vsftp虛拟多使用者多權限一鍵部署腳本

Ubuntu14.04 LTS下安裝mongodb

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

配置網頁内容通路

手動安裝Intel network I217-LM網卡的Linux驅動

禁止ubuntu系統彈出報錯界面

Ubuntu Linux下Apache的配置檔案

samba伺服器的功能

【Linux】UDP廣播封包接收速率問題

Linux裝置模型（中）之上層容器

PowerPC平台 Linux移植三