天天看點

【原創】(十三)Linux記憶體管理之vma/malloc/mmap

背景

  • Read the fucking source code!

    --By 魯迅
  • A picture is worth a thousand words.

    --By 高爾基

說明:

  1. Kernel版本:4.14
  2. ARM64處理器,Contex-A53,雙核
  3. 使用工具:Source Insight 3.5, Visio

1. 概述

這篇文章,讓我們來看看使用者态程序的位址空間情況,主要會包括以下:

  • vma

    ;
  • malloc

  • mmap

程序位址空間中,我們常見的代碼段,資料段,bss段等,實際上都是一段位址空間區域。Linux将位址空間中的區域稱為

Virtual Memory Area

, 簡稱

VMA

,使用

struct vm_area_struct

來描述。

在進行記憶體申請和映射時,都會去位址空間中申請一段虛拟位址區域,而這部分操作也與

vma

關系密切,是以本文将

vma/malloc/mmap

三個放到一塊來進行分析。

開啟探索之旅吧。

2. 資料結構

主要涉及兩個結構體:

struct mm_struct

struct vm_area_struct

  • struct mm_struct

    用于描述與程序位址空間有關的全部資訊,這個結構也包含在程序描述符中,關鍵字段的描述見注釋。
struct mm_struct {
	struct vm_area_struct *mmap;		/* list of VMAs */                              //指向VMA對象的連結清單頭
	struct rb_root mm_rb;                                                                     //指向VMA對象的紅黑樹的根
	u64 vmacache_seqnum;                   /* per-thread vmacache */
#ifdef CONFIG_MMU
	unsigned long (*get_unmapped_area) (struct file *filp,
				unsigned long addr, unsigned long len,
				unsigned long pgoff, unsigned long flags);              // 在程序位址空間中搜尋有效線性位址區間的方法
#endif
	unsigned long mmap_base;		/* base of mmap area */
	unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
	/* Base adresses for compatible mmap() */
	unsigned long mmap_compat_base;
	unsigned long mmap_compat_legacy_base;
#endif
	unsigned long task_size;		/* size of task vm space */
	unsigned long highest_vm_end;		/* highest vma end address */
	pgd_t * pgd;        //指向頁全局目錄

	/**
	 * @mm_users: The number of users including userspace.
	 *
	 * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
	 * to 0 (i.e. when the task exits and there are no other temporary
	 * reference holders), we also release a reference on @mm_count
	 * (which may then free the &struct mm_struct if @mm_count also
	 * drops to 0).
	 */
	atomic_t mm_users;      //使用計數器

	/**
	 * @mm_count: The number of references to &struct mm_struct
	 * (@mm_users count as 1).
	 *
	 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
	 * &struct mm_struct is freed.
	 */
	atomic_t mm_count;      //使用計數器

	atomic_long_t nr_ptes;			/* PTE page table pages */      //程序頁表數
#if CONFIG_PGTABLE_LEVELS > 2
	atomic_long_t nr_pmds;			/* PMD page table pages */
#endif
	int map_count;				/* number of VMAs */        //VMA的個數

	spinlock_t page_table_lock;		/* Protects page tables and some counters */
	struct rw_semaphore mmap_sem;

	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
						 * together off init_mm.mmlist, and are protected
						 * by mmlist_lock
						 */


	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
	unsigned long hiwater_vm;	/* High-water virtual memory usage */

	unsigned long total_vm;		/* Total pages mapped */    //程序位址空間的頁數
	unsigned long locked_vm;	/* Pages that have PG_mlocked set */    //鎖住的頁數,不能換出
	unsigned long pinned_vm;	/* Refcount permanently increased */
	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED & ~VM_STACK */     //資料段記憶體的頁數
	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE & ~VM_STACK */         //可執行記憶體映射的頁數
	unsigned long stack_vm;		/* VM_STACK */                                              //使用者态堆棧的頁數
	unsigned long def_flags;
	unsigned long start_code, end_code, start_data, end_data;       //代碼段,資料段等的位址
	unsigned long start_brk, brk, start_stack;      //堆棧段的位址,start_stack表示使用者态堆棧的起始位址,brk為堆的目前最後位址
	unsigned long arg_start, arg_end, env_start, env_end;  //指令行參數的位址,環境變量的位址

	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

	/*
	 * Special counters, in some configurations protected by the
	 * page_table_lock, in other configurations by being atomic.
	 */
	struct mm_rss_stat rss_stat;

	struct linux_binfmt *binfmt;

	cpumask_var_t cpu_vm_mask_var;

	/* Architecture-specific MM context */
	mm_context_t context;

	unsigned long flags; /* Must use atomic bitops to access the bits */

	struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_MEMBARRIER
	atomic_t membarrier_state;
#endif
#ifdef CONFIG_AIO
	spinlock_t			ioctx_lock;
	struct kioctx_table __rcu	*ioctx_table;
#endif
#ifdef CONFIG_MEMCG
	/*
	 * "owner" points to a task that is regarded as the canonical
	 * user/owner of this mm. All of the following must be true in
	 * order for it to be changed:
	 *
	 * current == mm->owner
	 * current->mm != mm
	 * new_owner->mm == mm
	 * new_owner->alloc_lock is held
	 */
	struct task_struct __rcu *owner;
#endif
	struct user_namespace *user_ns;

	/* store ref to file /proc/<pid>/exe symlink points to */
	struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
	struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
	struct cpumask cpumask_allocation;
#endif
#ifdef CONFIG_NUMA_BALANCING
	/*
	 * numa_next_scan is the next time that the PTEs will be marked
	 * pte_numa. NUMA hinting faults will gather statistics and migrate
	 * pages to new nodes if necessary.
	 */
	unsigned long numa_next_scan;

	/* Restart point for scanning and setting pte_numa */
	unsigned long numa_scan_offset;

	/* numa_scan_seq prevents two threads setting pte_numa */
	int numa_scan_seq;
#endif
	/*
	 * An operation with batched TLB flushing is going on. Anything that
	 * can move process memory needs to flush the TLB when moving a
	 * PROT_NONE or PROT_NUMA mapped page.
	 */
	atomic_t tlb_flush_pending;
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
	/* See flush_tlb_batched_pending() */
	bool tlb_flush_batched;
#endif
	struct uprobes_state uprobes_state;
#ifdef CONFIG_HUGETLB_PAGE
	atomic_long_t hugetlb_usage;
#endif
	struct work_struct async_put_work;

#if IS_ENABLED(CONFIG_HMM)
	/* HMM needs to track a few things per mm */
	struct hmm *hmm;
#endif
} __randomize_layout;
           
  • struct vm_area_struct

    用于描述程序位址空間中的一段虛拟區域,每一個

    VMA

    都對應一個

    struct vm_area_struct

/*
 * This struct defines a memory VMM memory area. There is one of these
 * per VM-area/task.  A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
	/* The first cache line has the info for VMA tree walking. */

	unsigned long vm_start;		/* Our start address within vm_mm. */       //起始位址
	unsigned long vm_end;		/* The first byte after our end address
					   within vm_mm. */         //結束位址,區間中不包含結束位址

	/* linked list of VM areas per task, sorted by address */       //按起始位址排序的連結清單
	struct vm_area_struct *vm_next, *vm_prev;

	struct rb_node vm_rb;       //紅黑樹節點

	/*
	 * Largest free memory gap in bytes to the left of this VMA.
	 * Either between this VMA and vma->vm_prev, or between one of the
	 * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
	 * get_unmapped_area find a free area of the right size.
	 */
	unsigned long rb_subtree_gap;

	/* Second cache line starts here. */

	struct mm_struct *vm_mm;	/* The address space we belong to. */
	pgprot_t vm_page_prot;		/* Access permissions of this VMA. */
	unsigned long vm_flags;		/* Flags, see mm.h. */

	/*
	 * For areas with an address space and backing store,
	 * linkage into the address_space->i_mmap interval tree.
	 */
	struct {
		struct rb_node rb;
		unsigned long rb_subtree_last;
	} shared;

	/*
	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	struct list_head anon_vma_chain; /* Serialized by mmap_sem &
					  * page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
	const struct vm_operations_struct *vm_ops;

	/* Information about our backing store: */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units */
	struct file * vm_file;		/* File we map to (can be NULL). */     //指向檔案的一個打開執行個體
	void * vm_private_data;		/* was vm_pte (shared mem) */

	atomic_long_t swap_readahead_info;
#ifndef CONFIG_MMU
	struct vm_region *vm_region;	/* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
#endif
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;
           

關系圖來了:

【原創】(十三)Linux記憶體管理之vma/malloc/mmap

是不是有點眼熟?這個跟核心中的

vmap機制

很類似。

宏觀的看一下程序位址空間中的各個

VMA

【原創】(十三)Linux記憶體管理之vma/malloc/mmap

針對

VMA

的操作,有如下接口:

/*  VMA的查找 */
/* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr); //查找第一個滿足addr < vm_end的VMA塊
extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
					     struct vm_area_struct **pprev); //與find_vma功能類似,不同之處在于還會傳回VMA連結的前一個VMA;
 static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr); //查找與start_addr~end_addr區域有交集的VMA
 
 /* VMA的插入 */
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); //插入VMA到紅黑樹中和連結清單中
 
 /* VMA的合并 */
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
	struct mempolicy *, struct vm_userfaultfd_ctx); //将VMA與附近的VMA進行融合操作
 
 /* VMA的拆分 */
 extern int split_vma(struct mm_struct *, struct vm_area_struct *,
	unsigned long addr, int new_below); //将VMA以addr為界線分成兩個VMA
           

上述的操作基本上也就是針對紅黑樹的操作。

3. malloc

malloc

大家都很熟悉,那麼它是怎麼與底層去互動并申請到記憶體的呢?

圖來了:

【原創】(十三)Linux記憶體管理之vma/malloc/mmap

如圖所示,

malloc

最終會調到底層的

sys_brk

函數和

sys_mmap

函數,在配置設定小記憶體時調用

sys_brk

函數,動态的調整程序位址空間中的

brk

位置;在配置設定大塊記憶體時,調用

sys_mmap

函數,在堆和棧之間找到一片區域進行映射處理。

先來看

sys_brk

函數,通過

SYSCALL_DEFINE1

來定義,整體的函數調用流程如下:

【原創】(十三)Linux記憶體管理之vma/malloc/mmap

從函數的調用過程中可以看出有不少操作是針對

vma

的,那麼結合起來的效果圖如下:

【原創】(十三)Linux記憶體管理之vma/malloc/mmap

整個過程看起來就比較清晰和簡單了,每個程序都用

struct mm_struct

來描述自身的程序位址空間,這些空間都是一些

vma

區域,通過一個紅黑樹和連結清單來管理。是以針對

malloc

的處理,會去動态的調整

brk

的位置,具體的大小則由

struct vm_area_struct

結構中的

vm_start ~ vm_end

來指定。在實際過程中,會根據請求配置設定區域是否與現有

vma

重疊的情況來進行處理,或者重新申請一個

vma

來描述這段區域,并最終插入到紅黑樹和連結清單中。

完成這段申請後,隻是開辟了一段區域,通常還不會立馬配置設定實體記憶體,實體記憶體的配置設定會發生在通路時出現缺頁異常後再處理,這個後續也會有文章來進一步分析。

4. mmap

mmap

用于記憶體映射,也就是将一段區域映射到自己的程序位址空間中,分為兩種:

  • 檔案映射: 将檔案區域映射到程序空間,檔案存放在儲存設備上;
  • 匿名映射:沒有檔案對應的區域映射,内容存放在實體記憶體上;

同時,針對其他程序是否可見,又分為兩種:

  • 私有映射:将資料源拷貝副本,不影響其他程序;
  • 共享映射:共享的程序都能看到;

根據排列組合,就存在以下幾種情況了:

  1. 私有匿名映射: 通常配置設定大塊記憶體時使用,堆,棧,bss段等;
  2. 共享匿名映射:常用于父子程序間通信,在記憶體檔案系統中建立

    /dev/zero

    裝置;
  3. 私有檔案映射:常用的比如動态庫加載,代碼段,資料段等;
  4. 共享檔案映射:常用于程序間通信,檔案讀寫等;

常見的

prot

權限和

flags

如下:

#define PROT_READ	0x1		/* page can be read */
#define PROT_WRITE	0x2		/* page can be written */
#define PROT_EXEC	0x4		/* page can be executed */
#define PROT_SEM	0x8		/* page may be used for atomic ops */
#define PROT_NONE	0x0		/* page can not be accessed */
#define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */

#define MAP_SHARED	0x01		/* Share changes */
#define MAP_PRIVATE	0x02		/* Changes are private */
#define MAP_TYPE	0x0f		/* Mask for type of mapping */
#define MAP_FIXED	0x10		/* Interpret addr exactly */
#define MAP_ANONYMOUS	0x20		/* don't use a file */

#define MAP_GROWSDOWN	0x0100		/* stack-like segment */
#define MAP_DENYWRITE	0x0800		/* ETXTBSY */
#define MAP_EXECUTABLE	0x1000		/* mark it as an executable */
#define MAP_LOCKED	0x2000		/* pages are locked */
#define MAP_NORESERVE	0x4000		/* don't check for reservations */
#define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
#define MAP_NONBLOCK	0x10000		/* do not block on IO */
#define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB	0x40000		/* create a huge page mapping */
           

mmap

的操作,最終會調用到

do_mmap

函數,最後來一張調用圖:

【原創】(十三)Linux記憶體管理之vma/malloc/mmap
【原創】(十三)Linux記憶體管理之vma/malloc/mmap

作者:LoyenWang

出處:https://www.cnblogs.com/LoyenWang/

公衆号:

LoyenWang

版權:本文版權歸作者和部落格園共有

轉載:歡迎轉載,但未經作者同意,必須保留此段聲明;必須在文章中給出原文連接配接;否則必究法律責任

繼續閱讀