天天看點

缺頁異常詳解

首先明确下什麼是缺頁異常,CPU通過位址總線可以通路連接配接在位址總線上的所有外設,包括實體記憶體、IO裝置等等,但從CPU發出的通路位址并非是這些外設在位址總線上的實體位址,而是一個虛拟位址,由MMU将虛拟位址轉換成實體位址再從位址總線上發出,MMU上的這種虛拟位址和實體位址的轉換關系是需要建立的,并且MMU還可以設定這個實體頁是否可以進行寫操作,當沒有建立一個虛拟位址到實體位址的映射,或者建立了這樣的映射,但那個實體頁不可寫的時候,MMU将會通知CPU産生了一個缺頁異常。

下面總結下缺頁異常的幾種情況:

1、當MMU中确實沒有建立虛拟頁實體頁映射關系,并且在該虛拟位址之後再沒有目前程序的線性區vma的時候,可以肯定這是一個編碼錯誤,這将殺掉該程序;

2、當MMU中确實沒有建立虛拟頁實體頁映射關系,并且在該虛拟位址之後存在目前程序的線性區vma的時候,這很可能是缺頁異常,并且可能是棧溢出導緻的缺頁異常;

3、當使用malloc/mmap等希望通路實體空間的庫函數/系統調用後,由于​​Linux​​并未真正給新建立的vma映射實體頁,此時若先進行寫操作,将如上面的2的情況産生缺頁異常,若先進行讀操作雖也會産生缺頁異常,将被映射給預設的零頁(zero_pfn),等再進行寫操作時,仍會産生缺頁異常,這次必須配置設定實體頁了,進入寫時複制的流程;

4、當使用fork等系統調用建立子程序時,子程序不論有無自己的vma,“它的”vma都有對于實體頁的映射,但它們共同映射的這些實體頁屬性為隻讀,即​​linux​​并未給子程序真正配置設定實體頁,當父子程序任何一方要寫相應實體頁時,導緻缺頁異常的寫時複制;

目前來看,應該就是這四種情況,還是比較清晰的,可發現一個重要規律就是,linux是直到實在不行的時候才會配置設定實體頁,把握這個原則了解的會好一些,下面詳細的看缺頁處理:

arm的缺頁處理函數為arch/arm/mm/fault.c檔案中的do_page_fault函數,關于缺頁異常是怎麼一步步調到這個函數的,同上一篇位置程序位址空間建立說的一樣,後面會有專題文章描述這個問題,現在隻關心缺頁異常的處理,下面是函數do_page_fault:

static int __kprobes

do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)

{

         struct task_struct *tsk;

         struct mm_struct *mm;

         int fault, sig, code;

    /*空函數*/

         if (notify_page_fault(regs, fsr))

                   return 0;

    /*擷取到缺頁異常的程序描述符和其記憶體描述符*/

         tsk = current;

         mm  = tsk->mm;

         /*

          * If we're in an interrupt or have no user

          * context, we must not take the fault..

          */

         /*1、判斷目前是否是在原子操作中(中斷、可延遲函數、臨界區)發生的異常

      2、通過mm是否存在判斷是否是核心線程,對于核心線程,程序描述符的mm總為NULL

      一旦成立,說明是在核心态中發生的異常,跳到标号no_context*/

         if (in_atomic() || !mm)

                   goto no_context;

          * As per x86, we may deadlock here.  However, since the kernel only

          * validly references user space from well defined areas of the code,

          * we can bug out early if this is from code which shouldn't.

         if (!down_read_trylock(&mm->mmap_sem)) {

                   if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))

                            goto no_context;

                   down_read(&mm->mmap_sem);

         } else {

                   /*

                    * The above down_read_trylock() might have succeeded in

                    * which case, we'll have missed the might_sleep() from

                    * down_read()

                    */

                   might_sleep();

#ifdef CONFIG_DEBUG_VM

                   if (!user_mode(regs) &&

                       !search_exception_tables(regs->ARM_pc))

#endif

         }

         fault = __do_page_fault(mm, addr, fsr, tsk);

         up_read(&mm->mmap_sem);

          * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR

         /*如果傳回值fault不是這裡面的值,那麼應該會是VM_FAULT_MAJOR或VM_FAULT_MINOR,說明問題解決了,傳回,一般正常情況下,__do_page_fault的傳回值fault會是0(VM_FAULT_MINOR)或者其他一些值,都不是下面之後會看到的這些*/

         if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))

         /*如果fault是VM_FAULT_OOM這個級别的錯誤,那麼這要殺掉程序*/

         if (fault & VM_FAULT_OOM) {

                    * We ran out of memory, call the OOM killer, and return to

                    * userspace (which will retry the fault, or kill us if we

                    * got oom-killed)

                   pagefault_out_of_memory();

          * If we are in kernel mode at this point, we

          * have no context to handle this fault with.

         /*再次判斷是否是核心空間出現了頁異常,并且通過__do_page_fault沒有沒有解決,跳到到no_context*/

         if (!user_mode(regs))

         /*下面兩個情況,通過英文注釋可以了解,

           一個是無法修複,另一個是通路非法位址,都是要殺掉程序的錯誤*/

         if (fault & VM_FAULT_SIGBUS) {

                    * We had some memory, but were unable to

                    * successfully fix up this page fault.

                   sig = SIGBUS;

                   code = BUS_ADRERR;

                    * Something tried to access memory that

                    * isn't in our memory map..

                   sig = SIGSEGV;

                   code = fault == VM_FAULT_BADACCESS ?

                            SEGV_ACCERR : SEGV_MAPERR;

         /*給使用者程序發送相應的信号,殺掉程序*/

         __do_user_fault(tsk, addr, fsr, sig, code, regs);

         return 0;

no_context:

    /*核心引發的異常處理,如修複不暢,核心也要殺掉*/

         __do_kernel_fault(mm, addr, fsr, regs);

}

首先看第一個重點,源碼片段如下:

/*1、判斷目前是否是在原子操作中(中斷、可延遲函數、臨界區)發生的異常

  2、通過mm是否存在判斷是否是核心線程,對于核心線程,程序描述符的mm總為NULL,一旦成立,說明是在核心态中發生的異常,跳到标号no_context*/

如果目前執行流程在核心态,不論是在臨界區(中斷/推後執行/臨界區)還是核心程序本身(核心的mm為NULL),說明在核心态出了問題,跳到标号no_context進入核心态異常處理,由函數__do_kernel_fault完成,這個函數首先盡可能的設法解決這個異常,通過查找異常表中和目前的異常對應的解決辦法并調用執行,這個部分的細節一直沒有找到在哪裡,如果找到的話留言告我一下吧!如果無法通過異常表解決,那麼核心就要在列印其頁表等内容後退出了!其源碼如下:

static void

__do_kernel_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,

                     struct pt_regs *regs)

          * Are we prepared to handle this kernel fault?

         /*fixup_exception()用于搜尋異常表,并試圖找到一個對應該異常的例程來進行修正,這個例程在fixup_exception()傳回後執行*/

         if (fixup_exception(regs))

                   return;

          * No handler, we'll have to terminate things with extreme prejudice.

         /*走到這裡就說明異常确實是由于核心的程式設計缺陷導緻的了,核心将産生一個oops,下面的工作就是列印CPU寄存器和核心态堆棧的資訊到控制台并終結目前的程序*/

         bust_spinlocks(1);

         printk(KERN_ALERT

                   "Unable to handle kernel %s at virtual address %08lx\n",

                   (addr < PAGE_SIZE) ? "NULL pointer dereference" :

                   "paging request", addr);

         /*列印核心一二級頁表資訊*/

         show_pte(mm, addr);

         /*核心産生一個oops,列印一堆東西準備退出*/

         die("Oops", regs, fsr);

         bust_spinlocks(0);

         /*核心退出了!*/

         do_exit(SIGKILL);

接上一篇

回到函數do_page_fault,如果不是核心的缺頁異常而是使用者程序的缺頁異常,那麼調用函數__do_page_fault,這個應該是本文的重點,主要讨論的是使用者程序的缺頁異常,結合最前面說的使用者程序産生缺頁異常的四種情況,函數__do_page_fault都會排查到,源碼如下:

__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,

                   struct task_struct *tsk)

         struct vm_area_struct *vma;

         int fault;

    /*搜尋出現異常的位址前向最近的的vma*/

         vma = find_vma(mm, addr);

         fault = VM_FAULT_BADMAP;

    /*如果vma為NULL,說明addr之後沒有vma,是以這個addr是個錯誤位址*/

         if (unlikely(!vma))

                   goto out;

    /*如果addr後面有vma,但不包含addr,不能斷定addr是錯誤位址,還需檢查*/

         if (unlikely(vma->vm_start > addr))

                   goto check_stack;

          * Ok, we have a good vm_area for this

          * memory access, so we can handle it.

good_area:

    /*權限錯誤也要傳回,比如缺頁報錯(由參數fsr辨別)報的是不可寫/不可執行的錯誤,但addr所屬vma線性區本身就不可寫/不可執行,那麼就直接傳回,因為問題根本不是缺頁,而是vma就已經有問題*/

         if (access_error(fsr, vma)) {

                   fault = VM_FAULT_BADACCESS;

          * If for any reason at all we couldn't handle the fault, make

          * sure we exit gracefully rather than endlessly redo the fault.

         /*為引發缺頁的程序配置設定一個實體頁框,它先确定與引發缺頁的線性位址對應的各級頁目錄項是否存在,如不存在則分進行配置設定。具體如何配置設定這個頁框是通過調用handle_pte_fault完成的*/

         fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0);

         if (unlikely(fault & VM_FAULT_ERROR))

                   return fault;

         if (fault & VM_FAULT_MAJOR)

                   tsk->maj_flt++;

         else

                   tsk->min_flt++;

         return fault;

check_stack:

    /*addr後面的vma的vm_flags含有VM_GROWSDOWN标志,這說明這個vma是屬于棧的vma,是以addr是在棧中,有可能是棧空間不夠時再進棧導緻的通路錯誤,同時檢視棧是否還能擴充,如果不能擴充(expand_stack傳回非0)則确認确實是棧溢出導緻,即addr确實是棧中位址,不是非法位址,應該進入缺頁中的請求調頁*/

         if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))

                   goto good_area;

out:

l  首先,檢視缺頁異常的這個虛拟位址addr,找它後面最近的vma,如果真的沒有找到,那麼說明通路的位址是真的錯誤了,因為它根本不在所配置設定的任何一個vma線性區;這是一種嚴重錯誤,将傳回錯誤碼(fault)VM_FAULT_BADMAP,核心會殺掉這個程序;

l  如果addr後面有vma,但addr并未落在這個vma的區間内,這存在一種可能,要知道棧的增長方向和堆是相反的即棧是向下增長,是以也許addr實際上是棧的一個位址,它後面的vma實際上是棧的vma,棧已無法擴充,即通路addr時,這個addr并沒有落在vma中是以更無二級頁表映射,導緻缺頁異常,是以檢視addr後面的vma是否是向下增長并且棧是否無法擴充,以此界定addr是不是棧位址,如果是則進入缺頁異常處理流程,否則同樣傳回錯誤碼(fault)VM_FAULT_BADMAP,核心會殺掉這個程序;

l  權限錯誤也就傳回,比如缺頁報錯(fsr)報的是不可寫,但vma本身就不可寫,那麼就直接傳回,因為問題根本不是缺頁,而是vma就已經有問題;傳回錯誤碼(fault) VM_FAULT_BADACCESS,這也是一種嚴重錯誤,核心會殺掉這個程序;s

l  最後是對确實缺頁異常的情況進行處理,調用函數handle_mm_fault,正常情況下将傳回VM_FAULT_MAJOR或VM_FAULT_MINOR,傳回錯誤碼fault并加一task的maj_flt或min_flt成員;

函數handle_mm_fault,就是為引發缺頁的程序配置設定一個實體頁框,它先确定與引發缺頁的線性位址對應的各級頁目錄項是否存在,如不存在則分進行配置設定。具體如何配置設定這個頁框是通過調用handle_pte_fault()完成的,注意最後一個參數flag,它來源于fsr,辨別寫異常和非寫異常,這是為了達到進一步推後配置設定實體記憶體的一個鋪墊;源碼如下:

int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,

                   unsigned long address, unsigned int flags)

         pgd_t *pgd;

         pud_t *pud;

         pmd_t *pmd;

         pte_t *pte;

         __set_current_state(TASK_RUNNING);

         count_vm_event(PGFAULT);

         if (unlikely(is_vm_hugetlb_page(vma)))

                   return hugetlb_fault(mm, vma, address, flags);

    /*傳回addr對應的一級頁表條目*/

         pgd = pgd_offset(mm, address);

    /*對于arm,pud就是pgd*/

         pud = pud_alloc(mm, pgd, address);

         if (!pud)

                   return VM_FAULT_OOM;

    /*對于arm,pmd就是pud就是pgd*/

         pmd = pmd_alloc(mm, pud, address);

         if (!pmd)

    /*傳回addr對應的二級頁表條目*/

         pte = pte_alloc_map(mm, pmd, address);

         if (!pte)

    /*該函數根據頁表項pte所描述的實體頁框是否在實體記憶體中,分為兩大類:

    請求調頁:被通路的頁框不在主存中,那麼此時必須配置設定一個頁框,分為線性(匿名/檔案)映射、非線性映射、swap情況下映射

    寫時複制:被通路的頁存在,但是該頁是隻讀的,核心需要對該頁進行寫操作,

             此時核心将這個已存在的隻讀頁中的資料複制到一個新的頁框中*/

         return handle_pte_fault(mm, vma, address, pte, pmd, flags);

首先注意下個細節,在二級頁表條目不存在時,會先建立條目;最終會調用函數handle_pte_fault,該函數功能注釋已經描述很清楚,源碼如下:

static inline int handle_pte_fault(struct mm_struct *mm,

                   struct vm_area_struct *vma, unsigned long address,

                   pte_t *pte, pmd_t *pmd, unsigned int flags)

         pte_t entry;

         spinlock_t *ptl;

         entry = *pte;

/*調頁請求:分為線性(匿名/檔案)映射、非線性映射、swap情況下映射

  注意,pte_present(entry)為0說明二級頁表條目pte映射的實體位址(即*pte)不存在,很可能是調頁請求*/

         if (!pte_present(entry)) {

        /*(pte_none(entry))為1說明二級頁表條目pte尚且沒有寫入任何實體位址,說明還根本從未配置設定實體頁*/

                   if (pte_none(entry)) {

            /*如果該vma的操作函數集合實作了fault函數,說明是檔案映射而不是匿名映射,将調用do_linear_fault配置設定實體頁*/

                            if (vma->vm_ops) {

                                     if (likely(vma->vm_ops->fault))

                                               return do_linear_fault(mm, vma, address,

                                                        pte, pmd, flags, entry);

                            }

            /*匿名映射的情況配置設定實體頁,最終調用alloc_pages*/

                            return do_anonymous_page(mm, vma, address,

                                                         pte, pmd, flags);

                   }

        /*(pte_file(entry))說明是非線性映射,調用do_nonlinear_fault配置設定實體頁*/

                   if (pte_file(entry))

                            return do_nonlinear_fault(mm, vma, address,

                                               pte, pmd, flags, entry);

        /*如果頁框事先被配置設定,但是此刻已經由主存換出到了外存,則調用do_swap_page()完成頁框配置設定*/

                   return do_swap_page(mm, vma, address,

/*寫時複制

    COW的場合就是通路映射的頁不可寫,有兩種情況、:

一種是之前給vma映射的是零頁(zero_pfn),

    另外一種是通路fork得到的程序空間(子程序與父程序共享父程序的隻讀頁)

    共同特點就是: 二級頁表條目不允許寫,簡單說就是該頁不可寫*/

         ptl = pte_lockptr(mm, pmd);

         spin_lock(ptl);

         if (unlikely(!pte_same(*pte, entry)))

                   goto unlock;

    /*是寫操作時發生的缺頁異常*/

         if (flags & FAULT_FLAG_WRITE) {

        /*二級頁表條目不允許寫,引發COW*/

                   if (!pte_write(entry))

                            return do_wp_page(mm, vma, address,

                                               pte, pmd, ptl, entry);

        /*标志本頁已髒*/

                   entry = pte_mkdirty(entry);

         entry = pte_mkyoung(entry);

         if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {

                   update_mmu_cache(vma, address, entry);

                    * This is needed only for protection faults but the arch code

                    * is not yet telling us if this is a protection fault or not.

                    * This still avoids useless tlb flushes for .text page faults

                    * with threads.

                   if (flags & FAULT_FLAG_WRITE)

                            flush_tlb_page(vma, address);

unlock:

         pte_unmap_unlock(pte, ptl);

回過頭看下那四個異常的情況,上面的内容會比較好了解些,首先擷取到二級頁表條目值entry,對于寫時複制的情況,它的異常addr的二級頁表條目還是存在的(就是說起碼存在标志L_PTE_PRESENT),隻是說映射的實體頁不可寫,是以由(!pte_present(entry))可界定這是請求調頁的情況;

在請求調頁情況下,如果這個二級頁表條目的值為0,即什麼都沒有,那麼說明這個位址所在的vma是完完全全沒有做過映射實體頁的操作,那麼根據該vma是否存在vm_ops成員即操作函數,并且vm_ops存在fault成員,這說明是檔案映射而非匿名映射,反之是匿名映射,分别調用函數do_linear_fault、do_anonymous_page;

仍然在請求調頁的情況下,如果二級頁表條目的值含有L_PTE_FILE标志,說明這是個非線性檔案映射,将調用函數do_nonlinear_fault配置設定實體頁;其他情況視為實體頁曾被配置設定過,但後來被linux交換出記憶體,将調用函數do_swap_page再配置設定實體頁;

檔案線性/非線性映射和交換分區的映射除請求調頁方面外,還涉及檔案、交換分區的很多内容,為簡化起見,下面僅以匿名映射為例描述使用者空間缺頁異常的實際處理,而事實上日常使用的malloc都是匿名映射;

匿名映射展現了linux為程序配置設定實體空間的基本态度,不到實在不行的時候不配置設定實體頁,當使用malloc/mmap申請映射一段實體空間時,核心隻是給該程序建立了段線性區vma,但并未映射實體頁,然後如果試圖去讀這段申請的程序空間,由于未建立相應的二級頁表映射條目,MMU會發出缺頁異常,而這時核心依然隻是把一個預設的零頁zero_pfn(這是在初始化時建立的,前面的記憶體頁表的文章描述過)給vma映射過去,當應用程式又試圖寫這段申請的實體空間時,這就是實在不行的時候了,核心才會給vma映射實體頁,源碼如下:

static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,

                   unsigned long address, pte_t *page_table, pmd_t *pmd,

                   unsigned int flags)

         struct page *page;

/*如果不是寫操作的話(即讀操作),那麼非常簡單,把zero_pfn的二級頁表條目賦給entry,因為這裡已經是缺頁異常的請求調頁的處理,又是讀操作,是以肯定是本程序第一次通路這個頁,是以這個頁裡面是什麼内容無所謂,配置設定個預設全零頁就好,進一步推遲實體頁的配置設定,這就會讓entry帶着zero_pfn跳到标号setpte*/

         if (!(flags & FAULT_FLAG_WRITE)) {

                   entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),

                                                        vma->vm_page_prot));

                   ptl = pte_lockptr(mm, pmd);

                   spin_lock(ptl);

        /*如果這個缺頁的虛拟位址對應的二級頁表條目所映射的内容居然在記憶體中,直接跳到标号unlock準備解鎖傳回*/

                   if (!pte_none(*page_table))

                            goto unlock;

        /*跳到标号setpte就是寫二級頁表條目的内容即映射内容,對于這類就是把entry即zero_pfn寫進去了*/

                   goto setpte;

/*如果是寫操作,就要配置設定一個新的實體頁了*/

         /* Allocate our own private page. */

    /*這裡為空函數*/

         pte_unmap(page_table);

    /*配置設定一個anon_vma執行個體,反向映射相關,可暫不關注*/

         if (unlikely(anon_vma_prepare(vma)))

                   goto oom;

    /*它将調用alloc_page,這個頁被0填充*/

         page = alloc_zeroed_user_highpage_movable(vma, address);

         if (!page)

         __SetPageUptodate(page);

         if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))

                   goto oom_free_page;

/*把該頁的實體位址加屬性的值賦給entry,這是二級頁表映射内容的基礎值*/

         entry = mk_pte(page, vma->vm_page_prot);

    /*如果是寫通路,那麼設定這個二級頁表條目屬性還要加入:髒且可寫*/

         if (vma->vm_flags & VM_WRITE)

                   entry = pte_mkwrite(pte_mkdirty(entry));

/*把page_table指向虛拟位址addr的二級頁表條目位址*/

         page_table = pte_offset_map_lock(mm, pmd, address, &ptl);

/*如果這個缺頁的虛拟位址對應的二級頁表條目所映射的内容居然在記憶體中,報錯傳回*/

         if (!pte_none(*page_table))

                   goto release;

    /*mm的rss成員加一,用于記錄配置設定給本程序的實體頁總數*/

         inc_mm_counter(mm, anon_rss);

    /*page_add_new_anon_rmap用于建立線性區和匿名頁的反向映射,可暫不關注*/

         page_add_new_anon_rmap(page, vma, address);

setpte:

/*給page_table這個二級頁表條目寫映射内容,内容是entry*/

         set_pte_at(mm, address, page_table, entry);

         /* No need to invalidate - it was non-present before */

    /*更新MMU*/

         update_mmu_cache(vma, address, entry);

         pte_unmap_unlock(page_table, ptl);

release:

         mem_cgroup_uncharge_page(page);

         page_cache_release(page);

         goto unlock;

oom_free_page:

oom:

         return VM_FAULT_OOM;

結合上面的描述和源碼注釋應該比較容易能了解請求調頁的原理和流程;

現在分析寫時複制COW,對于寫時複制,首先把握一點就是隻有寫操作時才有可能觸發寫時複制,是以首先總要判斷異常flag是否含有标志FAULT_FLAG_WRITE,然後判斷二級頁表條目值是否含有L_PTE_WRITE标志,這是意味着這個實體頁是否可寫,如果不可寫則說明應該進入寫時複制流程,調用處理函數do_wp_page;

可見,COW的應用場合就是通路映射的頁不可寫,它包括兩種情況,第一種是fork導緻,第二種是如malloc後第一次對他進行讀操作,擷取到的是zero_pfn零頁,當再次寫時需要寫時複制,共同特點都是虛拟位址的二級頁表映射内容在記憶體中,但是對應的頁不可寫,在函數do_wp_page中對于這兩種情況的處理基本相似的;

另外一個應該知道的是,如果該頁隻有一個程序在用,那麼就直接修改這個頁可寫就行了,不要搞COW,總之,不到不得以的情況下是不會進行COW的,這也是核心對于COW使用的原則,就是盡量不使用;

函數do_wp_page源碼如下:

static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

                   spinlock_t *ptl, pte_t orig_pte)

         struct page *old_page, *new_page;

         int reuse = 0, ret = 0;

         int page_mkwrite = 0;

         struct page *dirty_page = NULL;

    /*傳回不可寫的頁的頁描述符,如果是COW的第一種情況即zero_pfn可讀頁,傳回NULL,将進入下面的if流程;第二種情況即(父子程序)共享頁将正常傳回其頁描述符*/

         old_page = vm_normal_page(vma, address, orig_pte);

         if (!old_page) {

                    * VM_MIXEDMAP !pfn_valid() case

                    *

                    * We should not cow pages in a shared writeable mapping.

                    * Just mark the pages writable as we can't do any dirty

                    * accounting on raw pfn maps.

                   /*如果這個vma是可寫且共享的,跳到标号reuse,這就不會COW

          否則跳到标号gotten*/

                   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==

                                          (VM_WRITE|VM_SHARED))

                            goto reuse;

                   goto gotten;

          * Take out anonymous pages first, anonymous shared vmas are

          * not dirty accountable.

/*下面的if和else流程,都是為了盡可能不進行COW,它們試圖進入标号reuse*/

         /*如果該頁old_page是匿名頁(由頁描述符的mapping),

           并且隻有一個程序使用該頁(reuse_swap_page,由頁描述符的_mapcount值是否為0),那麼不要搞什麼COW了,這個程序就是可以使用該頁*/

         if (PageAnon(old_page) && !PageKsm(old_page)) {

        /*排除其他程序在使用該頁的情況,由頁描述符的flag*/

                   if (!trylock_page(old_page)) {

                            page_cache_get(old_page);

                            pte_unmap_unlock(page_table, ptl);

                            lock_page(old_page);

                            page_table = pte_offset_map_lock(mm, pmd, address,

                                                                  &ptl);

                            if (!pte_same(*page_table, orig_pte)) {

                                     unlock_page(old_page);

                                     page_cache_release(old_page);

                                     goto unlock;

                            page_cache_release(old_page);

        /*判斷該頁描述符的_mapcount值是否為0*/

                   reuse = reuse_swap_page(old_page);

                   unlock_page(old_page);

         } 

    /*如果vma是共享且可寫,看看這種情況下有沒有機會不COW*/

    else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==

                                               (VM_WRITE|VM_SHARED))) {

                    * Only catch write-faults on shared writable pages,

                    * read-only shared pages can get COWed by

                    * get_user_pages(.write=1, .force=1).

                   if (vma->vm_ops && vma->vm_ops->page_mkwrite) {

                            struct vm_fault vmf;

                            int tmp;

                            vmf.virtual_address = (void __user *)(address &

                                                                           PAGE_MASK);

                            vmf.pgoff = old_page->index;

                            vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;

                            vmf.page = old_page;

                            /*

                             * Notify the address space that the page is about to

                             * become writable so that it can prohibit this or wait

                             * for the page to get into an appropriate state.

                             *

                             * We do this without the lock held, so that it can

                             * sleep if it needs to.

                             */

                            tmp = vma->vm_ops->page_mkwrite(vma, &vmf);

                            if (unlikely(tmp &

                                               (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {

                                     ret = tmp;

                                     goto unwritable_page;

                            if (unlikely(!(tmp & VM_FAULT_LOCKED))) {

                                     lock_page(old_page);

                                     if (!old_page->mapping) {

                                               ret = 0; /* retry the fault */

                                               unlock_page(old_page);

                                               goto unwritable_page;

                                     }

                            } else

                                     VM_BUG_ON(!PageLocked(old_page));

                             * Since we dropped the lock we need to revalidate

                             * the PTE as someone else may have changed it.  If

                             * they did, we just return, as we can count on the

                             * MMU to tell us if they didn't also make it writable.

                            page_mkwrite = 1;

                   dirty_page = old_page;

                   get_page(dirty_page);

                   reuse = 1;

    /*reuse: 不進行COW,直接操作該頁old_page*/

         if (reuse) {

reuse:

                   flush_cache_page(vma, address, pte_pfn(orig_pte));

                   entry = pte_mkyoung(orig_pte);

        /*寫該頁的二級頁表屬性,加入可寫且髒*/

                   entry = maybe_mkwrite(pte_mkdirty(entry), vma);

                   if (ptep_set_access_flags(vma, address, page_table, entry,1))

                            update_mmu_cache(vma, address, entry);

                   ret |= VM_FAULT_WRITE;

          * Ok, we need to copy. Oh, well..

/*真正的COW即将開始*/

    /*首先增加之前的頁的被映射次數(get_page(), page->_count)*/

         page_cache_get(old_page);

gotten:

    /*COW的第一種情況(zero_pfn),将配置設定新頁并清零該頁*/

         if (is_zero_pfn(pte_pfn(orig_pte))) {

                   new_page = alloc_zeroed_user_highpage_movable(vma, address);

                   if (!new_page)

                            goto oom;

    /*COW的第二種情況(fork),申請一個頁,并把old_page頁的内容拷貝到新頁new_page(4K位元組的内容)*/

    else {

                   new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);

                   cow_user_page(new_page, old_page, address, vma);

         __SetPageUptodate(new_page);

          * Don't let another task, with possibly unlocked vma,

          * keep the mlocked page.

         /*COW第二種情況下,如果vma還是鎖定的,那還需要解鎖*/

         if ((vma->vm_flags & VM_LOCKED) && old_page) {

                   lock_page(old_page);      /* for LRU manipulation */

                   clear_page_mlock(old_page);

         if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))

                   goto oom_free_new;

          * Re-check the pte - we dropped the lock

         /*再擷取下通路異常的位址addr對應的二級頁表條目位址page_table*/

         page_table = pte_offset_map_lock(mm, pmd, address, &ptl);

         if (likely(pte_same(*page_table, orig_pte))) {

                   if (old_page) {

                            if (!PageAnon(old_page)) {

                                     dec_mm_counter(mm, file_rss);

                                     inc_mm_counter(mm, anon_rss);

                   } else

                            inc_mm_counter(mm, anon_rss);

        /*寫新頁的二級頁表條目内容為髒*/

                   entry = mk_pte(new_page, vma->vm_page_prot);

                    * Clear the pte entry and flush it first, before updating the

                    * pte with the new entry. This will avoid a race condition

                    * seen in the presence of one thread doing SMC and another

                    * thread doing COW.

                   ptep_clear_flush(vma, address, page_table);

                   page_add_new_anon_rmap(new_page, vma, address);

                    * We call the notify macro here because, when using secondary

                    * mmu page tables (such as kvm shadow page tables), we want the

                    * new page to be mapped directly into the secondary page table.

                   set_pte_at_notify(mm, address, page_table, entry);

                             * Only after switching the pte to the new page may

                             * we remove the mapcount here. Otherwise another

                             * process may come and find the rmap count decremented

                             * before the pte is switched to the new page, and

                             * "reuse" the old page writing into it while our pte

                             * here still points into it and can be read by other

                             * threads.

                             * The critical issue is to order this

                             * page_remove_rmap with the ptp_clear_flush above.

                             * Those stores are ordered by (if nothing else,)

                             * the barrier present in the atomic_add_negative

                             * in page_remove_rmap.

                             * Then the TLB flush in ptep_clear_flush ensures that

                             * no process can access the old page before the

                             * decremented mapcount is visible. And the old page

                             * cannot be reused until after the decremented

                             * mapcount is visible. So transitively, TLBs to

                             * old page will be flushed before it can be reused.

                            page_remove_rmap(old_page);

                   /* Free the old page.. */

                   new_page = old_page;

    else

                   mem_cgroup_uncharge_page(new_page);

         if (new_page)

                   page_cache_release(new_page);

         if (old_page)

                   page_cache_release(old_page);

         if (dirty_page) {

                    * Yes, Virginia, this is actually required to prevent a race

                    * with clear_page_dirty_for_io() from clearing the page dirty

                    * bit after it clear all dirty ptes, but before a racing

                    * do_wp_page installs a dirty pte.

                    * do_no_page is protected similarly.

                   if (!page_mkwrite) {

                            wait_on_page_locked(dirty_page);

                            set_page_dirty_balance(dirty_page, page_mkwrite);

                   put_page(dirty_page);

                   if (page_mkwrite) {

                            struct address_space *mapping = dirty_page->mapping;

                            set_page_dirty(dirty_page);

                            unlock_page(dirty_page);

                            page_cache_release(dirty_page);

                            if (mapping)      {

                                     /*

                                      * Some device drivers do not set page.mapping

                                      * but still dirty their pages

                                      */

                                     balance_dirty_pages_ratelimited(mapping);

                   /* file_update_time outside page_lock */

                   if (vma->vm_file)

                            file_update_time(vma->vm_file);

         return ret;

oom_free_new:

         page_cache_release(new_page);

         if (old_page) {

                            unlock_page(old_page);

unwritable_page:

         page_cache_release(old_page);

一級一級傳回,最終傳回到函數__do_page_fault,會根據傳回值fault累計task的相應異常類型次數(maj_flt或min_flt),并最終把fault傳回給函數do_page_fault,釋放信号量mmap_sem,正常情況下就傳回0,缺頁異常處理完畢。

------------------越是喧嚣的世界,越需要甯靜的思考------------------

合抱之木,生于毫末;九層之台,起于壘土;千裡之行,始于足下。

積土成山,風雨興焉;積水成淵,蛟龍生焉;積善成德,而神明自得,聖心備焉。故不積跬步,無以至千裡;不積小流,無以成江海。骐骥一躍,不能十步;驽馬十駕,功在不舍。锲而舍之,朽木不折;锲而不舍,金石可镂。蚓無爪牙之利,筋骨之強,上食埃土,下飲黃泉,用心一也。蟹六跪而二螯,非蛇鳝之穴無可寄托者,用心躁也。

繼續閱讀