深入解讀Linux程序排程Schedule【轉】

長文慎入～～

排程系統是現代作業系統非常核心的基礎子系統之一，尤其在多任務并行作業系統（Multitasking OS）上，系統可能運作于單核或者多核CPU上，程序可能處于運作狀态或者在記憶體中可運作等待狀态。如何實作多任務同時使用資源并且提供給使用者及時的響應實作實時互動以及提供高流量并發等對現代作業系統的設計實作帶來了巨大挑戰，而Linux排程子系統的設計同樣需要實作這些看似沖突的要求，适應不同的使用場景。

我們看到Linux是一個複雜的現在作業系統，各個子系統之間互相合作才能完成高效的任務。本文從圍繞排程子系統，介紹了排程子系統核心的概念，并且将其與Linux各個相關元件的關系進行探讨，尤其是與排程子系統息息相關的中斷（softirq和irq）子系統以及定時器Timer，深入而全面地展示了排程相關的各個概念以及互相聯系。

由于筆者最近在調試PowerPC相關的晶片，是以相關的介紹會以此為例提取相關的核心源代碼進行解讀展示。涉及的代碼為Linux-4.4穩定釋出版本，讀者可以檢視源碼進行對照。

1. 相關概念

要了解排程子系統，首先需要總體介紹排程的流程，對系統有一個高屋建瓴的認識之後，再在整體流程中對各個節點分别深入分析，進而掌握豐富而飽滿的細節。

在系統啟動早期，會注冊硬體中斷，時鐘中斷是硬體中斷中非常重要的一種，排程過程中需要不斷地重新整理程序的狀态以及設定排程标志已決定是否搶占程序的執行進行排程。時鐘中斷就是周期性地完成此項工作。這裡又引出另外一個現代OS的排程設計思想即搶占（preempt），而與其對應的概念則為非搶占或者合作（cooperate），後面會給出兩者的詳細差別。時鐘中斷屬于硬體中斷，Linux系統不支援中斷嵌套，是以在中斷發生時又會禁止本地中斷（local_irq_disable），而為了盡快相應其他可能的硬體事件，必須要盡快完成處理并開啟中斷，是以引出了中斷下半部，也就是softirq的概念。同樣在排程過程中有很多定時器（Timer／Hrtimer）會被啟動來完成相應的工作。在排程發生時，針對程序的資源需求類型又有不同的排程政策，是以出現了不同的排程類，以實作不同的排程算法完成不同場景下的需求。是以本文從中斷和軟中斷，定時器和高精度定時器，搶占和非搶占，實時和普通程序排程，鎖合并發等角度進行深入分析，并将相關的概念聯系起來，以完成對Linux核心排程子系統的解讀。

1.1 Preemptive

Preemptive Multitasking系統上，排程器決定運作中的程序何時中止運作換出而新的程序開始執行，該過程稱為搶占Preemption，而搶占前的程序運作時間一般為提前設定的時間片(Timeslice)，時間片的設定與程序優先級有關，根據實際的排程類方法決定，排程類後面會具體介紹。在定時器中斷處理過程中對程序的運作時間vruntime進行重新整理，如果已經超過了程序可運作的時間片，則設定目前程序current的thread_info flag的排程标志TIF_NEED_RESCHED，在下一個排程入口會調用need_resched函數判斷該标志，如果被設定則會進入排程過程，換出目前程序并選擇新程序開始執行。關于排程入口，下面章節會進行詳細介紹。

1.2 Cooperative

非搶占的Cooperative Multitasking系統最大的特點就是程序隻有在主動決定放棄CPU的時候才開始排程其他程序執行，稱為yielding，排程器無法控制全局的程序運作狀态和時間，這其中最大的缺點就是挂起的程序可能會導緻整個系統停止運作，無法排程。程序在因為需要等待特定的信号活着事件發生時會放棄CPU而進入睡眠，通過主動調用schedule進入排程。

1.3 Nice

系統普通程序一般會設定一個數值Nice來決定其優先級，在使用者空間可以通過nice系統調用設定程序的nice值。Nice取值範圍在-20 ~ +19，程序時間片大小一般根據nice值進行調整，nice值越高則程序時間片一般會配置設定越小，通過ps -el可以檢視。nice可以了解為對别的程序nice一些。

1.4 Real-time priority

程序實時優先級，與nice為兩個不同次元的考量，取值範圍為0 ~ 99，值越大則其優先級越高，一般實時程序real-time process的優先級高于普通程序normal process。ps -eo state,uid,pid,ppid,rtprio,time,comm可以檢視具體資訊，其中-代表程序非實時，數值代表實時優先級。

2. 排程器的類型

根據任務的資源需求類型可以将其分為IO-bounced和Processor-bounced程序，其中IO-bounced可以較為廣義的了解，比如網絡裝置以及鍵盤滑鼠等，實時性要求較高，但是CPU占用可能并不密集。Processor-bounced程序對CPU的使用較為密集，比如加密解密過程，圖像處理等。針對任務類型區分排程，可以實作較好的體驗，提高實時性的互動，同時可以預留大量的CPU資源給計算密集型的程序。是以在排程設計中采用了複雜的算法保證及時響應以及大吞吐量。

有五種排程類：

fair_sched_class，現在較高版本的Linux上也就是CFS(Completely Fair Scheduler)，Linux上面主要的排程方式，由CONFIG_FAIR_GROUP_SCHED宏控制

rt_sched_class，由CONFIG_RT_GROUP_SCHED宏控制，實時排程類型。

dl_sched_class，deadline排程類，實時排程中較進階别的排程類型，一般之後在系統緊急情況下會調用；

stop_sched_class，最高優先級的排程類型，屬于實時排程類型的一種，在系統終止時會在其上建立程序進入排程。

idle_sched_class，優先級最低，在系統空閑時才會進入該排程類型排程，一般系統中隻有一個idle，那就是初始化程序init_task，在初始化完成後它将自己設定為idle程序，并不做更多工作。

3. 排程子系統的初始化

start_kernel函數調用sched_init進入排程的初始化。首先配置設定alloc_size大小的記憶體，初始化root_task_group，root_task_group為系統預設的task group，系統啟動階段每個程序都屬于該task group需要注意root_task_group中的成員是針對perCPU的。初始化完成之後将init_task标記為idle程序。具體看下面函數中的注釋。

void __init sched_init(void)

{

int i, j;

unsigned long alloc_size = 0, ptr;

/* calculate the size to be allocated for root_task_group items.

* some items in the struct task_group are per-cpu fields, so use

* no_cpu_ids here.

#ifdef CONFIG_FAIR_GROUP_SCHED

alloc_size += 2 * nr_cpu_ids * sizeof(void **);

#endif

#ifdef CONFIG_RT_GROUP_SCHED

if (alloc_size) {

/* allocate mem here. */

ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);

root_task_group.se = (struct sched_entity **)ptr;

ptr += nr_cpu_ids * sizeof(void **);

root_task_group.cfs_rq = (struct cfs_rq **)ptr;

#endif /* CONFIG_FAIR_GROUP_SCHED */

root_task_group.rt_se = (struct sched_rt_entity **)ptr;

root_task_group.rt_rq = (struct rt_rq **)ptr;

#endif /* CONFIG_RT_GROUP_SCHED */

}

#ifdef CONFIG_CPUMASK_OFFSTACK

/* Use dynamic allocation for cpumask_var_t, instead of putting them on the stack.

* This is a bit more expensive, but avoids stack overflow.

* Allocate load_balance_mask for every cpu below.

for_each_possible_cpu(i) {

per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(

cpumask_size(), GFP_KERNEL, cpu_to_node(i));

#endif /* CONFIG_CPUMASK_OFFSTACK */

/* init the real-time task group cpu time percentage.

* the hrtimer of def_rt_bandwidth is initialized here.

init_rt_bandwidth(&def_rt_bandwidth,

global_rt_period(), global_rt_runtime());

/* init the deadline task group cpu time percentage. */

init_dl_bandwidth(&def_dl_bandwidth,

#ifdef CONFIG_SMP

/* 初始化預設排程域，排程域包含一個或者多個CPU，負載均衡是在排程域之内執行，互相之間進行隔離 */

init_defrootdomain();

init_rt_bandwidth(&root_task_group.rt_bandwidth,

#ifdef CONFIG_CGROUP_SCHED

/* 将配置設定并初始化好的邋root_task_group加入到錿ask_groups全局連結清單 */

list_add(&root_task_group.list, &task_groups);

INIT_LIST_HEAD(&root_task_group.children);

INIT_LIST_HEAD(&root_task_group.siblings);

/* 初始化自動分組 */

autogroup_init(&init_task);

#endif /* CONFIG_CGROUP_SCHED */

/* 周遊每個cpu的運作隊列，對其進行初始化 */

struct rq *rq;

rq = cpu_rq(i);

raw_spin_lock_init(&rq->lock);

/* CPU運作隊列的所有排程實體(sched_entity)的數目 */

rq->nr_running = 0;

/* CPU負載 */

rq->calc_load_active = 0;

/* 負載更新時間 */

rq->calc_load_update = jiffies + LOAD_FREQ;

/* 分别初始化運作隊列的cfs rt和dl隊列 */

init_cfs_rq(&rq->cfs);

init_rt_rq(&rq->rt);

init_dl_rq(&rq->dl);

/* root的CPU總的配額 */

root_task_group.shares = ROOT_TASK_GROUP_LOAD;

INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);

* How much cpu bandwidth does root_task_group get?

* In case of task-groups formed thr' the cgroup filesystem, it

* gets 100% of the cpu resources in the system. This overall

* system cpu resource is divided among the tasks of

* root_task_group and its child task-groups in a fair manner,

* based on each entity's (task or task-group's) weight

* (se->load.weight).

* In other words, if root_task_group has 10 tasks of weight

* 1024) and two child groups A0 and A1 (of weight 1024 each),

* then A0's share of the cpu resource is:

* A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%

* We achieve this by letting root_task_group's tasks sit

* directly in rq->cfs (i.e root_task_group->se[] = NULL).

/* 初始化cfs_bandwidth，普通程序占有的CPU資源，初始化排程類相應的高精度定時器 */

init_cfs_bandwidth(&root_task_group.cfs_bandwidth);

/* 目前CPU運作隊列的cfs_rq的task_group指定為tg, 即root_task_group */

/* 指定cfs_rq的rq為目前CPU運作隊列rq */

/* root_task_group在目前CPU上的cfs_rq */

/* 目前schedule_entity se是空 */

init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);

rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;

/* 類似前面init_tg_cfs_entry的初始化, 完成互相指派 */

init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);

/* 初始化該隊列所儲存的每個CPU的負載情況 */

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)

rq->cpu_load[j] = 0;

/* 該隊列最後更新CPU負載的時間 */

rq->last_load_update_tick = jiffies;

/* 初始化負載均衡相關的參數 */

rq->sd = NULL;

rq->rd = NULL;

rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;

rq->balance_callback = NULL;

rq->active_balance = 0;

rq->next_balance = jiffies;

rq->push_cpu = 0;

rq->cpu = i;

rq->online = 0;

rq->idle_stamp = 0;

rq->avg_idle = 2*sysctl_sched_migration_cost;

rq->max_idle_balance_cost = sysctl_sched_migration_cost;

INIT_LIST_HEAD(&rq->cfs_tasks);

/* CPU運作隊列加入到預設排程域中 */

rq_attach_root(rq, &def_root_domain);

#ifdef CONFIG_NO_HZ_COMMON

/* 動态時鐘使用标志位，初始時間未使用 */

rq->nohz_flags = 0;

#ifdef CONFIG_NO_HZ_FULL

/* 動态時鐘使用的标志位，用于儲存上次排程tick發生時間 */

rq->last_sched_tick = 0;

/* 運作隊列高精度定時器的初始化，還未正式生效 */

init_rq_hrtick(rq);

atomic_set(&rq->nr_iowait, 0);

/* 設定初始化程序的load權重 */

set_load_weight(&init_task);

#ifdef CONFIG_PREEMPT_NOTIFIERS

/* init_task的搶占通知鍊初始化 */

INIT_HLIST_HEAD(&init_task.preempt_notifiers);

* The boot idle thread does lazy MMU switching as well:

atomic_inc(&init_mm.mm_count);

enter_lazy_tlb(&init_mm, current);

* During early bootup we pretend to be a normal task:

/* 設定初始化程序采用fair排程類 */

current->sched_class = &fair_sched_class;

* Make us the idle thread. Technically, schedule() should not be

* called from this thread, however somewhere below it might be,

* but because we are the idle thread, we just pick up running again

* when this runqueue becomes "idle".

/* 将目前程序變更為idle程序，将其各項資訊重新初始化，排程類設定兩位idle排程器 */

init_idle(current, smp_processor_id());

calc_load_update = jiffies + LOAD_FREQ;

zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT);

/* May be allocated at isolcpus cmdline parse time */

if (cpu_isolated_map == NULL)

zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);

idle_thread_set_boot_cpu();

set_cpu_rq_start_time();

/* 初始化fair排程類，其實實際上是注冊SCHED_SOFTIRQ類型的軟中斷處理函數run_rebalance_domains，執行負載平衡過程 */

/* 這裡的問題是SCHED_SOFTIRQ軟中斷是何時觸發?*/

init_sched_fair_class();

/* 标記排程器開始運作，但是此時系統隻有init_task一個程序，且為idle程序，

* 定時器暫時還未啟動，不會排程到其它程序，是以繼續回到start_kernel執行初始化過程。

scheduler_running = 1;

}

在sched_init初始化之後，繼續回到start_kernel執行，跟排程相關的内容是：

init_IRQ

該函數中會初始化IRQ的棧空間，包括系統中所有的軟體中斷和硬體中斷。時鐘中斷是排程的驅動因素，包括硬體中斷和軟中斷下半部，在這裡也進行了初始化。中斷相關的内容後面章節會有詳細的介紹，此處需要了解整個初始化流程，知道這個點做了什麼。

init_timers

此處會初始化timer，注冊TIMER_SOFTIRQ軟中斷回調函數run_timer_softirq，關于softirq的内容我會在最後進行介紹。既然在這裡注冊了softirq，那麼在哪裡開始激活或啟動該softirq呢？該softirq的作用是什麼？

在時鐘中斷的注冊章節我們會看到，tick_handle_periodic為時鐘中斷的事件回調函數，在time_init中被指派到時鐘中斷的回調函數鈎子處，發生時鐘中斷是會被調用做中斷處理。該函數最終調用tick_periodic，繼續調用update_process_times，進而再調用run_local_timers函數來打開TIMER_SOFTIRQ，同時run_local_timers也調用接口hrtimer_run_queues運作高精度定時器。這是中斷處理的典型方式，即硬體中斷處理關鍵部分，啟動softirq後打開硬體中斷響應，更多的事務在軟中斷下半部中處理。關于該軟中斷的具體作用後面會詳細介紹，這裡需要了解的是它會激活所有過期的定時器。

time_init

執行時鐘相關的初始化，後面會看到，我們在系統初始化初期的彙編階段會注冊硬體中斷向量表，但是中斷裝置和事件處理函數并未初始化，這裡調用init_decrementer_clockevent初始化時鐘中斷裝置，并初始化時間回調tick_handle_periodic；同時調用tick_setup_hrtimer_broadcast注冊高精度定時器裝置及其回調，在中斷發生時實際會被執行。此時硬體中斷被激活。

sched_clock_postinit和sched_clock_init

開啟排程時間相關的定時器定期更新資訊。

4. 排程的處理過程

4.1 schedule()接口

首先需要關閉搶占，防止排程重入，然後調用__schedule，進行current相關的處理，比如有待決信号，則繼續标記狀态為TASK_RUNNING，或者如果需要睡眠則調用deactivate_task将從運作隊列移除後加入對應的等待隊列，通過pick_next_task選擇下一個需要執行的程序，進行context_switch進入新程序運作。

4.2 pick_next_task

首先判斷目前程序排程類sched_class是否為fair_sched_calss，也就是CFS，如果是且目前cpu的排程隊列下所有排程實體數目與其下面所有CFS排程類的下屬群組内的排程實體數目總數相同，即無RT等其他排程類中有排程實體存在（rq->nr_running == rq->cfs.h_nr_running），則直接傳回目前排程類fair_sched_class的pick_next_task選擇結果，否則需要周遊所有排程類for_each_class(class)，傳回class->pick_next_task的非空結果。

這裡需要關注的是for_each_class的周遊過程，從sched_class_highest開始，也就是stop_sched_class。

#define sched_class_highest (&stop_sched_class)

#define for_each_class(class) \

for (class = sched_class_highest; class; class = class->next)

extern const struct sched_class stop_sched_class;

extern const struct sched_class dl_sched_class;

extern const struct sched_class rt_sched_class;

extern const struct sched_class fair_sched_class;

extern const struct sched_class idle_sched_class;

4.3 各個排程類的關聯

按照優先級依次羅列組成單連結清單：

stop_sched_class->next->dl_sched_class->next->rt_sched_class->next->fair_sched_class->next->idle_sched_class->next=MULL

4.4 排程類的注冊

在編譯過程中通過early_initcall(cpu_stop_init)進行stop相關的注冊，cpu_stop_init對cpu_stop_threads進行了注冊，其create方法被調用時實際執行了cpu_stop_create->sched_set_stop_task，對stop的sched_class進行注冊，create的執行路徑如下：

cpu_stop_init->

smpboot_register_percpu_thread->

smpboot_register_percpu_thread_cpumask->

__smpboot_create_thread->

cpu_stop_threads.create(即cpu_stop_create)

現在回到pick_next_task，由于stop_sched_class作為最進階别排程類将所有系統中的排程類連結起來，周遊過程檢視所有sched_class，從最高優先級開始，直到找到一個可以排程的程序傳回，如果整個系統空閑，則之中會排程到系統初始化程序init_task，其最後被設定為idle程序在系統空閑時的排程執行，上面對sched_init的解釋裡面有詳細說明。

5. 排程的入口

Timer interrupt is responsible for decrementing the running process’s timeslice count.When the count reaches zero, need_resched is set and the kernel runs the scheduler as soon as possible

在時鐘中斷中更新程序執行時間資訊，如果時間片用完，則設定need_resched，在接下來的排程過程中換出正在執行的程序。

RTC(Real-Time Clock)

實時時鐘，非易失性裝置存儲系統時間，在系統啟動時，通過COMS連接配接裝置到系統，讀取對應的時間資訊提供給系統設定。

System Timer

系統定時器由電子時鐘以可程式設計頻率實作，驅動系統時鐘中斷定期發生，也有部分架構通過減法器decrementer實作，通過計數器設定初始值，以固定頻率減少直到為0，然後出發時鐘中斷。

The timer interrupt is broken into two pieces: an architecture-dependent and an architecture-independent

routine.

The architecture-dependent routine is registered as the interrupt handler for the system

timer and, thus, runs when the timer interrupt hits. Its exact job depends on the

given architecture, of course, but most handlers perform at least the following work:

1. Obtain the xtime_lock lock, which protects access to jiffies_64 and the wall

time value, xtime.

2. Acknowledge or reset the system timer as required.

3. Periodically save the updated wall time to the real time clock.

4. Call the architecture-independent timer routine, tick_periodic().

The architecture-independent routine, tick_periodic(), performs much more work:

1. Increment the jiffies_64 count by one. (This is safe, even on 32-bit architectures,

because the xtime_lock lock was previously obtained.)

2. Update resource usages, such as consumed system and user time, for the currently

running process.

3. Run any dynamic timers that have expired (discussed in the following section).

4. Execute scheduler_tick(), as discussed in Chapter 4.

5. Update the wall time, which is stored in xtime.

6. Calculate the infamous load average.

6. 時鐘中斷（Timer Interrupt）

時鐘中斷是系統中排程和搶占的驅動因素，在時鐘中斷中會進行程序運作時間的更新等，并更新排程标志，以決定是否進行排程。下面以Powerpc FSL Booke架構晶片ppce500為例來看具體代碼，其他架構類似，設計思想相同。

6.1 時鐘中斷的注冊

首先在系統最開始的啟動階段注冊中斷處理函數，這個過程發生在start_kernel執行之前的彙編初始化部分，在系統初始化完成後時鐘中斷發生時執行中斷回調函數。

IBM的PowerPC架構的核心啟動入口head檔案在arch/powerpc/kernel/下，其中e500架構的核心入口檔案為head_fsl_booke.S，其中定義了中斷向量清單：

interrupt_base:

/* Critical Input Interrupt */

CRITICAL_EXCEPTION(0x0100, CRITICAL, CriticalInput, unknown_exception)

......

/* Decrementer Interrupt */

DECREMENTER_EXCEPTION

時鐘中斷的定義為DECREMENTER_EXCEPTION，實際展開過程在arch/powerpc/kernel/head_booke.h頭檔案中：

#define EXC_XFER_TEMPLATE(hdlr, trap, msr, copyee, tfer, ret) \

li r10,trap; \

stw r10,_TRAP(r11); \

lis r10,msr@h; \

ori r10,r10,msr@l; \

copyee(r10, r9); \

bl tfer; \

.long hdlr; \

.long ret

#define EXC_XFER_LITE(n, hdlr) \

EXC_XFER_TEMPLATE(hdlr, n+1, MSR_KERNEL, NOCOPY, transfer_to_handler, \

ret_from_except)

#define DECREMENTER_EXCEPTION \

START_EXCEPTION(Decrementer) \

NORMAL_EXCEPTION_PROLOG(DECREMENTER); \

lis r0,TSR_DIS@h; /* Setup the DEC interrupt mask */ \

mtspr SPRN_TSR,r0; /* Clear the DEC interrupt */ \

addi r3,r1,STACK_FRAME_OVERHEAD; \

EXC_XFER_LITE(0x0900, timer_interrupt)

再來看timer_interrupt函數：

static void __timer_interrupt(void)

struct pt_regs *regs = get_irq_regs();

u64 *next_tb = this_cpu_ptr(&decrementers_next_tb);

struct clock_event_device *evt = this_cpu_ptr(&decrementers);

u64 now;

trace_timer_interrupt_entry(regs);

if (test_irq_work_pending()) {

clear_irq_work_pending();

irq_work_run();

}

now = get_tb_or_rtc();

if (now >= *next_tb) {

*next_tb = ~(u64)0;

if (evt->event_handler)

evt->event_handler(evt);

__this_cpu_inc(irq_stat.timer_irqs_event);

} else {

now = *next_tb - now;

if (now <= DECREMENTER_MAX)

set_dec((int)now);

/* We may have raced with new irq work */

if (test_irq_work_pending())

set_dec(1);

__this_cpu_inc(irq_stat.timer_irqs_others);

#ifdef CONFIG_PPC64

/* collect purr register values often, for accurate calculations */

if (firmware_has_feature(FW_FEATURE_SPLPAR)) {

struct cpu_usage *cu = this_cpu_ptr(&cpu_usage_array);

cu->current_tb = mfspr(SPRN_PURR);

trace_timer_interrupt_exit(regs);

* timer_interrupt - gets called when the decrementer overflows,

* with interrupts disabled.

void timer_interrupt(struct pt_regs * regs)

struct pt_regs *old_regs;

/* Ensure a positive value is written to the decrementer, or else

* some CPUs will continue to take decrementer exceptions.

set_dec(DECREMENTER_MAX);

/* Some implementations of hotplug will get timer interrupts while

* offline, just ignore these and we also need to set

* decrementers_next_tb as MAX to make sure __check_irq_replay

* don't replay timer interrupt when return, otherwise we'll trap

* here infinitely :(

if (!cpu_online(smp_processor_id())) {

return;

/* Conditionally hard-enable interrupts now that the DEC has been

* bumped to its maximum value

may_hard_irq_enable();

#if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)

if (atomic_read(&ppc_n_lost_interrupts) != 0)

do_IRQ(regs);

old_regs = set_irq_regs(regs);

irq_enter();

__timer_interrupt();

irq_exit();

set_irq_regs(old_regs);

在__timer_interrupt函數中執行了evt->event_handler函數調用，此處event_handler是什麼，究竟是怎麼注冊的呢？

答案是tick_handle_periodic，該函數實際上為中斷事件真正的處理過程，前面的interrupt handler僅僅是為中斷做一些準備工作，如完成寄存器等相關資訊的儲存等，做好了入口工作，二下面的event_handler則完成了中斷事件實際想做的事情，其函數定義如下：

* Event handler for periodic ticks

void tick_handle_periodic(struct clock_event_device *dev)

int cpu = smp_processor_id();

ktime_t next = dev->next_event;

tick_periodic(cpu);

#if defined(CONFIG_HIGH_RES_TIMERS) || defined(CONFIG_NO_HZ_COMMON)

* The cpu might have transitioned to HIGHRES or NOHZ mode via

* update_process_times() -> run_local_timers() ->

* hrtimer_run_queues().

if (dev->event_handler != tick_handle_periodic)

if (!clockevent_state_oneshot(dev))

for (;;) {

* Setup the next period for devices, which do not have

* periodic mode:

next = ktime_add(next, tick_period);

if (!clockevents_program_event(dev, next, false))

return;

* Have to be careful here. If we're in oneshot mode,

* before we call tick_periodic() in a loop, we need

* to be sure we're using a real hardware clocksource.

* Otherwise we could get trapped in an infinite

* loop, as the tick_periodic() increments jiffies,

* which then will increment time, possibly causing

* the loop to trigger again and again.

if (timekeeping_valid_for_hres())

tick_periodic(cpu);

tick_handle_periodic的注冊和執行流程如下：

start_kernel->time_init->init_decrementer_clockevent->register_decrementer_clockevent->clockevents_register_device->tick_check_new_device->tick_setup_periodic->tick_set_periodic_handler->tick_handle_periodic->tick_periodic->update_process_times->scheduler_tick

後面一段為tick_handle_periodic的執行流程調用，可以看到在scheduler_tick中又調用了排程類的task_tick函數接口，如果目前采用CFS排程政策則執行fair_sched_class->task_tick，同樣的在rt_sched_class中實作為task_tick_rt，實作如下：

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)

struct sched_rt_entity *rt_se = &p->rt;

update_curr_rt(rq);

watchdog(rq, p);

* RR tasks need a special form of timeslice management.

* FIFO tasks have no timeslices.

if (p->policy != SCHED_RR)

if (--p->rt.time_slice)

p->rt.time_slice = sched_rr_timeslice;

* Requeue to the end of queue if we (and all of our ancestors) are not

* the only element on the queue

for_each_sched_rt_entity(rt_se) {

if (rt_se->run_list.prev != rt_se->run_list.next) {

requeue_task_rt(rq, p, 0);

resched_curr(rq);

}

可以看到，如果目前時間片還未用完，則直接傳回，否則将程序實時時間片設定為sched_rr_timeslice，并且将排程實體的程序放置到排程隊列rq的末尾，調用resched_curr設定排程資訊後傳回，這裡實際上是實時排程的RR(Round Robin)思想。

現在又有新的問題，設定了程序的排程标志TIF_NEED_RESCHED之後，實際的排程何時發生呢？

排程的入口分為四個：

中斷傳回；

系統調用傳回使用者空間；

程序主動放棄cpu執行排程；

信号處理完成後傳回核心空間；

時鐘中斷傳回導緻程序排程為第1種，此處以ppce500為例來看排程如何發生：

各種異常傳回的入口RET_FROM_EXC_LEVEL，調用user_exc_return而進入do_work

而do_work作為總的入口點進入執行過程：

do_work: /* r10 contains MSR_KERNEL here */

andi. r0,r9,_TIF_NEED_RESCHED

beq do_user_signal

可以看到，如果未設定排程标志，則會執行restore_user傳回之前的調用棧

do_user_signal: /* r10 contains MSR_KERNEL here */

ori r10,r10,MSR_EE

SYNC

MTMSRD(r10) /* hard-enable interrupts */

/* save r13-r31 in the exception frame, if not already done */

lwz r3,_TRAP(r1)

andi. r0,r3,1

beq 2f

SAVE_NVGPRS(r1)

rlwinm r3,r3,0,0,30

stw r3,_TRAP(r1)

2: addi r3,r1,STACK_FRAME_OVERHEAD

mr r4,r9

bl do_notify_resume

REST_NVGPRS(r1)

b recheck

調用do_resched的地方為同樣定義在entry_32.S的recheck函數：

recheck:

/* Note: And we don't tell it we are disabling them again

* neither. Those disable/enable cycles used to peek at

* TI_FLAGS aren't advertised.

LOAD_MSR_KERNEL(r10,MSR_KERNEL)

SYNC

MTMSRD(r10) /* disable interrupts */

CURRENT_THREAD_INFO(r9, r1)

lwz r9,TI_FLAGS(r9)

bne- do_resched

andi. r0,r9,_TIF_USER_WORK_MASK

beq restore_user

在entry_32.S中可以看到在函數do_resched中調用了schedule函數執行了排程：

do_resched: /* r10 contains MSR_KERNEL here */

/* Note: We don't need to inform lockdep that we are enabling

* interrupts here. As far as it knows, they are already enabled

ori r10,r10,MSR_EE

MTMSRD(r10) /* hard-enable interrupts */

bl schedule

6.2 時鐘中斷的執行過程

在前面的中斷向量定義中可以看到有一個處理過程為bl tfer;這裡的tfer為transfer_to_handler或者transfer_to_handler_full，在時鐘中斷中為transfer_to_handler，主要做了一些中斷處理函數調用之前的準備處理過程，然後跳轉到中斷執行過程hdlr，最後進入ret執行，ret對應函數ret_from_except或者ret_from_except_full，在時鐘中斷中對應為ret_from_except，進而調用resume_kernel後進入preempt_schedule_irq執行排程過程：

* this is the entry point to schedule() from kernel preemption

* off of irq context.

* Note, that this is called and return with irqs disabled. This will

* protect us against recursive calling from irq.

asmlinkage __visible void __sched preempt_schedule_irq(void)

enum ctx_state prev_state;

/* Catch callers which need to be fixed */

BUG_ON(preempt_count() || !irqs_disabled());

prev_state = exception_enter();

do {

preempt_disable();

local_irq_enable();

__schedule(true);

local_irq_disable();

sched_preempt_enable_no_resched();

} while (need_resched());

exception_exit(prev_state);

接下來看看函數preempt_disable和local_irq_disable

static __always_inline volatile int *preempt_count_ptr(void)

return &current_thread_info()->preempt_count;

其實關閉搶占隻是将目前程序狀态資訊preempt_count增加相應的值1，在此調用之後又barrier()操作，防止編譯器優化和記憶體通路順序問題，達到同步的目的。

* Wrap the arch provided IRQ routines to provide appropriate checks.

#define raw_local_irq_disable() arch_local_irq_disable()

#define raw_local_irq_enable() arch_local_irq_enable()

#define raw_local_irq_save(flags) \

do { \

typecheck(unsigned long, flags); \

flags = arch_local_irq_save(); \

} while (0)

#define raw_local_irq_restore(flags) \

arch_local_irq_restore(flags); \

#define raw_local_save_flags(flags) \

flags = arch_local_save_flags(); \

#define raw_irqs_disabled_flags(flags) \

({ \

arch_irqs_disabled_flags(flags); \

})

#define raw_irqs_disabled() (arch_irqs_disabled())

#define raw_safe_halt() arch_safe_halt()

#define local_irq_enable() do { raw_local_irq_enable(); } while (0)

#define local_irq_disable() do { raw_local_irq_disable(); } while (0)

#define local_irq_save(flags) \

do { \

raw_local_irq_save(flags); \

#define local_irq_restore(flags) do { raw_local_irq_restore(flags); } while (0)

#define safe_halt() do { raw_safe_halt(); } while (0)

跟架構相關的irq操作定義如下：

static inline void arch_local_irq_restore(unsigned long flags)

#if defined(CONFIG_BOOKE)

asm volatile("wrtee %0" : : "r" (flags) : "memory");

#else

mtmsr(flags);

static inline unsigned long arch_local_irq_save(void)

unsigned long flags = arch_local_save_flags();

#ifdef CONFIG_BOOKE

asm volatile("wrteei 0" : : : "memory");

SET_MSR_EE(flags & ~MSR_EE);

return flags;

static inline void arch_local_irq_disable(void)

arch_local_irq_save();

static inline void arch_local_irq_enable(void)

asm volatile("wrteei 1" : : : "memory");

unsigned long msr = mfmsr();

SET_MSR_EE(msr | MSR_EE);

static inline bool arch_irqs_disabled_flags(unsigned long flags)

return (flags & MSR_EE) == 0;

static inline bool arch_irqs_disabled(void)

return arch_irqs_disabled_flags(arch_local_save_flags());

#define hard_irq_disable() arch_local_irq_disable()

6.3 IRQ介紹

這裡來分析一下ppce500的irq相關内容：

e500為booke架構晶片，與經典的powerpc架構有所差别，對于外部中斷異常處理而言，主要是擷取中斷向量位址的方式差異。其中經典架構中是根據異常類型得到偏移 offset, 異常向量的實體位址為 :

MSR[IP]=0 時，Vector = offset ;

MSR[IP]=1 時，Vector = offset | 0xFFF00000;

其中 MSR[IP] 代表 Machine State Register 的 Interrupt Prefix 比特，該比特用來選擇中斷向量的位址字首。

而booke架構晶片則是從異常類型對應的 IVOR(Interrupt Vector Offset Register) 得到偏移 ( 隻取低 16 比特 , 最低 4 比特清零 )，加上 IVPR(Interrupt Prefix Register) 的高 16 比特，構成中斷向量的位址：

Vector = (IVORn & 0xFFF0) | (IVPR & 0xFFFF0000);

值得注意的是，跟經典 PowerPC 不同，Book E 的中斷向量是 Effective Address, 對應 Linux 核心的虛拟位址。Book E架構的MMU是一直開啟的，是以不會運作在實模式(real mode)，在初始化過程中是通過在TLB中手動建立位址轉換條目實作位址轉換，建立頁表之後會根據頁表資訊更新TLB。這裡可以列出來在核心源代碼裡面的注釋資訊：

* Interrupt vector entry code

* The Book E MMUs are always on so we don't need to handle

* interrupts in real mode as with previous PPC processors. In

* this case we handle interrupts in the kernel virtual address

* space.

* Interrupt vectors are dynamically placed relative to the

* interrupt prefix as determined by the address of interrupt_base.

* The interrupt vectors offsets are programmed using the labels

* for each interrupt vector entry.

* Interrupt vectors must be aligned on a 16 byte boundary.

* We align on a 32 byte cache line boundary for good measure.

下面是手冊上面關于Fixed-Interval Timer Interrupt的章節說明：

Fixed-Interval Timer Interrupt, A fixed-interval timer interrupt occurs when no higher priority exception exists, a fixed-interval timer exception exists (TSR[FIS] = 1), and the interrupt is enabled (TCR[FIE] = 1 and (MSR[EE] = 1 or (MSR[GS] = 1 ))). See Section 9.5, “Fixed-Interval Timer.”

The fixed-interval timer period is determined by TCR[FPEXT] || TCR[FP], which specifies one of 64 bit locations of the time base used to signal a fixed-interval timer exception on a transition from 0 to 1. TCR[FPEXT] || TCR[FP] = 0b0000_00 selects TBU[0]. TCR[FPEXT] || TCR[FP] = 0b1111_11 selects TBL[63].

NOTE: Software Considerations

MSR[EE] also enables other asynchronous interrupts.

TSR[FIS] is set when a fixed-interval timer exception exists.

SRR0, SRR1, and MSR, are updated as shown in this table.

SRR0 Set to the effective address of the next instruction to be executed.

SRR1 Set to the MSR contents at the time of the interrupt.

MSR

• CM is set to EPCR[ICM]

• RI, ME, DE, CE are unchanged

• All other defined MSR bits are cleared

TSR FIS is set when a fixed-interval timer exception exists, not as a result of the interrupt. See Section 4.7.2, “Timer Status Register (TSR).”

Instruction execution resumes at address IVPR[0–47] || IVOR11[48–59] || 0b0000.

To avoid redundant fixed-interval timer interrupts, before reenabling MSR[EE], the interrupt handler must clear TSR[FIS] by writing a word to TSR using mtspr with a 1 in any bit position to be cleared and 0 in all others. Data written to the TSR is not direct data, but a mask. Writing a 1 to this bit causes it to be cleared; writing a 0 has no effect.

————————————————

深入解讀Linux程序排程Schedule【轉】

繼續閱讀

PAT (Top Level) Practise 1019 Separate the Animals (35)

HDU 4717 The Moving Points

PAT (Advanced Level) Practise 1131 Subway Map (30)

HYSBZ 2243 染色

HDU 4723 How Long Do You Have to Draw

HDU 4725 The Shortest Path in Nya Graph

HDU 4721 Food and Productivity

PAT (Advanced Level) Practise 1135 Is It A Red-Black Tree (30)

ZOJ 2967 Colorful Rainbows

HDU 4867 Xor

CSU 1806 Toll

ZOJ 1032 Area 2

POJ 2115 C Looooops

HDU 5381 The sum of gcd

ZOJ 2748 Free Kick

開源按鍵元件Multi_Button的使用,含測試工程