Analysis of CMA technical principles

preface

This article introduces the principle of CMA (Contiguous Memory Allocator) technology, analyzes the initialization and allocation process of CMA from the source code, and explains the knowledge involved in page migration, LRU (Least Rencntly Used) cache, PCP (per cpu page) cache and so on.

1. Overview of CMA

What is CMA? Why do you need a CMA?

The Linux partner system (Buddy) uses Page granularity to manage memory, with a size of 4K per page. The partner system mounts memory to a free_list linked list of different lengths according to the length of the free memory block. The unit of free_list is incremented by (2^order pages), i.e. 1 page, 2 page、... 2^n, usually the maximum order is 10, and the corresponding free memory size is 4M bytes. We use the partner system to apply for consecutive physical pages, the maximum page size is 4M bytes, and when the system memory fragmentation is serious, it is difficult to allocate to high-order pages.

Some peripheral devices on embedded systems, such as GPU, Camera, HDMI, etc. need to reserve a large amount of contiguous memory to work normally, and in many cases only 4M contiguous memory is not enough to meet the needs of the device, of course, we can also use the memblock reserved memory method to reserve a larger contiguous memory, but this part of the memory can only be used by the device and Buddy can not be used, resulting in memory waste. CMA is born, that is, we must be able to allocate a continuous large memory space to the device, and usually give the memory to the system when the device is not in use, maximizing the use of memory.

The CMA contiguous memory allocator is primarily intended to be used to allocate contiguous chunks of memory. When the system is initialized, it will reserve a physical memory area, usually when the device driver is not in use, the memory management system will use this area to allocate and manage removable type pages, and provide it to the APP or kernel movable page for use. When the device driver is used, the pages that have been allocated at this time are migrated away, and the area is used for contiguous memory allocation.

In the following chapters, we mainly introduce the initialization, allocation and page migration process of CMA by reading the source code.

Note: The source code posted in the subsequent article is kernel5.4 version, and the code screenshot will omit the secondary code and only retain the critical code.

Second, CMA's main data structures and APIs

2.1 struct cma

Use struct cma to describe a CMA area:

base_pfn: The starting page frame number of the physical address of the CMA region

count: The number of pages in the CMA area

BitMap: describes the allocation of pages in the CMA area, 1 means allocated and 0 is idle.

order_per_bit: Indicates the number of pages (2^order_per_bit) represented by a bit in the bitmap.

2.2 cma_init_reserved_mem

Obtain a block of memory with the address base and size from the reserved memory block to create and initialize the struct CMA.

2.3 cma_init_reserved_areas

In order to improve memory utilization, this function is used to return this CMA memory token to the buddy system for buddy to apply as removable page memory.

2.4 cma_alloc

Used to allocate count contiguous pages from the specified CMA area, aligned according to align.

2.5 cma_release

Used to release consecutive pages that have been allocated count.

Third, the main process analysis of CMA

3.1 CMA initialization process

3.1.1 System initialization:

The system initialization process requires the creation of a CMA region by selecting the reserved memory of dts or through command-line arguments. Here we look at the frequently used reserved memory method through DTS, and the description of physical memory is placed in DTS configuration, such as:

linux, CMA is the CMA zone name.

compatible须为“shared-dma-pool”。

Resuable indicates that CMA memory can be used by the Buddy system.

size represents the size of the CMA region in bytes

alignment specifies the address alignment size of the CMA region.

Linux, the cma-default attribute indicates that the current CMA memory will be used as the default CMA pool for CMA memory applications.

During the system startup process, the kernel parses the DTB file described above to complete the memory information registration, and the call process is:

setup_arch

arm64_memblock_init

early_init_fdt_scan_reserved_mem

__reserved_mem_init_node

__reserved_mem_init_node will iterate through the contents of the __reservedmem_of_table section, check that there is a compatible match in the dts (CMA here is "shared-dma-pool") and further execute the corresponding initfn. Everything defined by RESERVEDMEM_OF_DECLARE will be linked to the __reservedmem_of_table section, and eventually call to the function that uses the RESERVEDMEM_OF_DECLARE definition, rmem_cma_setup as follows:

3.1.2 rmem_cma_setup

@1 cma_init_reserved_mem Get a block of memory with address base and size from the reserved memory block, here use the address information parsed in dtb to initialize CMA, used to create and initialize struct cma, the code is simple:

@2 If dts specifies linux, cma-default, the dma_contiguous_set_default will point to this CMA region, which will be separated by default when using dma_alloc_contiguous to allocate memory from CMA.

At this point, CMA is the same as other reserved memory, which is placed in memblock.reserved, and this part of the reserved memory is not used by the Buddy system. As mentioned earlier, in order to improve memory utilization, it is also necessary to return the CMA part of the memory mark to the Buddy system, for Buddy to provide the APP or kernel memory application as a removable page, which is implemented by the cma_init_reserved_areas.

3.1.3 cma_init_reserved_areas

The initialization functions described core_initcall are called later in kernel initialization:

cma_init_reserved_areas, it calls the cma_activate_area directly to implement. cma_activate_area allocate bitmaps according to the size of the CMA, and then loop the call init_cma_reserved_pageblock to manipulate all pages in the CMA area, look at the source code:

@1 The CMA region is managed by a bitmap to manage the state of each page, cma_bitmap_maxno calculate how much memory Bitmap requires, and the i variable indicates how many pageblocks (4M) there are in that CMA eara.

@2 Iterate through all the pageblocks in that CM area

@3 Make sure all pages in the CMA region are within a zone

@4 Final call init_cma_reserved_pageblock, processed in pageblock, set migrate type to MIGRATE_CMA, add pages to partner system and update the total number of pages managed by zone. As follows:

@1 clears the reserved flag that has been set on the page.

@2 Set migratetype to MIGRATE_CMA

@3 Loop calls __free_pages function to release all pages in the CMA region into the buddy system.

@4 Update the amount of memory managed by the partner system.

At this point, the subsequent part of the CMA memory can be applied for by buddy. In partner systems, migratetype is movable and the flag is allocated with CMA, memory can be allocated from CMA:

3.2 CMA allocation process

Before reading the CMA allocation process code, let's take a look at its function call process, and then analyze the process and each function through the source code.

3.2.1 cma_alloc

@1 The calculation of the bitmap is mainly to obtain the maximum number of available bits (bitmap_maxno) of bimap, how large bitmap (bitmap_count) is needed for this allocation, etc.

@2 According to the bitmap information calculated above, find a free position from the bitmap.

@3 Some special cases (discussed later) often cause CMA allocation to fail, and when the allocation returns EBUSY, msleep(100) is required to retry, which will be retried 5 times by default.

@4 The corresponding bitmap of the page to be allocated is preset to 1, indicating that it has been allocated.

@5 Use alloc_config_range for memory allocation, which is analyzed in detail in a later section.

@6 If the allocation fails, the bitmap is cleared.

3.2.2 "Batching" in the kernel: LRU cache and PCP cache

Before analyzing the alloc_config_range, let's insert two knowledge points LRU cache and PCP cache, in reading the kernel source code, we will find that the kernel likes to use some "batch" methods to improve efficiency and reduce some lock overhead.

1. LRU cache

The classic LRU (Least Rencntly Used) linked list algorithm is shown below:

Note: For a detailed introduction to the LRU algorithm, please refer to the previous article of the kernel craftsman: kswapd introduction

NEWLY ALLOCATED PAGES ARE CONSTANTLY ADDED TO THE ACTIVE LRU LINKED LIST, AND THE ACTIVE LRU LINKED LIST IS CONSTANTLY TAKEN OUT AND PUT INTO THE INACTIVE LRU LINKED LIST. The lock (pgdat->lru_lock) in the linked list is very competitive, and if the page transfer is carried out one by one, the competition for the lock will be very serious.

In order to improve this situation, the kernel added a PER-CPU LRU cache (represented by struct pagevec), and pages to join the LRU linked list will first be put into the LRU cache of the current CPU until the LRU is full (usually 15 pages), and then obtain lru_lock, and put these pages into the LRU linked list in batches at one time.

2.PCP(PER-CPU PAGES)缓存

Since memory pages are common resources, the system frequently allocates free pages, which will cause a lot of consumption due to the acquisition of zone->locks, and synchronization operations between CPUs. Also to improve this situation, the kernel added a per CPU page cache (struct per_cpu_pages), and each CPU requested a small number of pages from Buddy wholesale to be stored locally. When the system needs to apply for memory, it is first taken from the PCP cache, and then wholesale from buddy when it runs out. The PCP cache is also put back first when freed, and the cache is full and then put back into the buddy system.

The kernel previously only supported order=0 PCPs, and the community has a patch to support order>0 per-cpu.

3.2.3 alloc_config_range functions:

Continue to see what alloc_config_range of the cma_alloc process does:

In short, the purpose is to get a clean contiguous memory block from a "dirty" contiguous memory block (which has been used by various types of memory), either reclaim it or migrate it away, and finally return this clean contiguous memory to the caller for use, as shown below:

Read the following code:

@1 start_isolate_page_range: Change the migration type of the pageblock of the target memory block from MIGRATE_CMA to MIGRATE_ISOLATE. Because the Buddy system does not migrate pageBlock allocation pages from MIGRATE_ISOLATE, it prevents these pages from being separated from Buddy during the CMA distribution process.

@2 drain_all_pages: Recycle per-cpu pages, PCP has been introduced earlier, the recycling process needs to return the pages placed in the PCP cache to Buddy.

@3 __alloc_contig_migrate_range: Migrating a page that has been used by the target memory block is to copy the page content to another memory area and update the reference to the page.

@3.1 lru_cache_disable: Because pages cached in LRU cannot be migrated, you need to flash pagevec pages to LRU first, that is, pages that are ready to be added to the LRU linked list, but have not yet joined LRU (still in LRU cache) to LRU, and disable the LRU cache function.

@3.2 isolate_migratepages_range Isolate the pages to be allocated areas that have been used by Buddy, store them in a linked list of cc, and return the box number of the last page scanned and processed. The main purpose of isolation here is to prevent the subsequent migration process, when the page is released or used by the LRU recycling path.

@3.3 reclaim_clean_pages_from_list: For clean file pages, just recycle them directly.

@3.4 migrate_pages: This function is the main interface for page migration in the kernel state, and most of the functions related to page migration in the kernel are called up, which migrates removable physical pages to a newly allocated page.

It is described in detail in the next section.

@3.5 lru_cache_enable the migration process completes, re-enable LRU PAGEVEC

@4.undo_isolate_page_range: @1的逆过程pageblock的迁移类型从 MIGRATE_ISOLATE reverts to MIGRATE_CMA.

Finally, these pages are returned to the caller.

3.3 CMA Release Process

The code cma_release freeing up CMA memory is as simple as assigning the page from the new free to Buddy and clear to CMA's bitmap, here is the code directly:

4. Page migration

If the system uses the memory of the CMA region, the pages on the memory must be migrable, so that the pages can be migrated when the device uses CMA, so which pages can be migrated? There are two types:

1. Pages on LRU, pages on LRU linked list are pages mapped to the user process address space, such as anonymous pages and file pages, are all separated from the buddy allocator migrate type for movable pageblock.

2. Not on the LRU, but on movable pages. Non-LRU pages are usually pages allocated for kernel spaces, and to implement migrations you need to drive the relevant methods in page->mapping->a_ops. For example, the page of our common zsmalloc memory allocator supports migration.

migrate_pages () is the main interface for page migration in the kernel state, and most of the functions involved in the kernel are called to it. As shown in the figure below, migrate_pages() is nothing more than to assign a new page, break the mapping relationship of the old page, remap the resume to the new page, and copy the content of the old page to the new page, the struct page property of the new page should be set the same as the old page, and finally release the old page. Let's read its source code.

4.1 migrate_pages：

migrate_pages functions and parameters:

from: Prepare the linked list of migrated pages

get_new_page: Pointer to request a new page function

putnew_page: Release the pointer to the new page function

private: The parameter passed to the get_new_page, CMA does not use NULL here

mode: Migration mode, the migration mode of CMA is set to MIGRATE_SYNC. There are the following:

reason: The reason for the migration, and records what features triggered the behavior of the migration. Because many paths to the kernel need to be migrated with migrate_pages, such as memory regularization, hot plugging, etc. CMA passes a MR_CONTIG_RANG, indicating that the call alloc_contig_range() allocates contiguous memory.

Look at migrate_pages code, which iterates through the linked list from and calls unmap_and_move each page to implement the migration process.

4.2 unmap_and_move

unmap_and_move function has exactly the same parameters as migrate_pages, it calls get_new_page allocate a new page, and then uses __unmap_and_move to migrate the page to this newly allocated page, we mainly look at the following __unmap_and_move

4.3 __unmap_and_move：

@1尝试获取old page's page lock PG_locked, if the page has been locked by another process, the attempt to acquire the lock will fail, and the page will be skipped directly for the asynchronous migration of MIGRATE_ASYNC mode if the lock cannot be obtained. The CMA migration mode is MIGRATE_SYNC, which must be used here lock_page must wait until the lock.

@2处理正在回写的页面, determine whether to wait for page writeback to complete based on the migration mode. MIGRATE_SYNC_LIGHT and MIGRATE_ASYNC do not wait, the CMA migration mode is MIGRATE_SYNC, and the wait_on_page_writeback() function is called to wait for the page writeback to complete.

@3 For anonymous pages, to prevent the migration process from freeing anon_vma data structures, it is necessary to increase the anon_vma->refcount reference count using page_get_anon_vma.

@4 Get the page lock PG_locked new page, you can get it under normal circumstances.

@5 Determine if this page is a non-LRU page,

If the page is a non-LRU page, it is handled by calling the move_to_new_page, which calls back to the driven miratepage function to perform the page migration.

If it is an LRU page, proceed to @6

@6 Determine whether a user PTE has mapped the page change by page_mapped(). If so, call try_to_unmap() to disarm all relevant PTEs from the old page through the reverse mapping mechanism.

@7 Call move_to_new_page to copy the contents of the old page and the struct page attribute data to the new page. For LRU pages move_to_new_page do two things by calling migrate_page: copying the properties of the struct page and the page content.

@8对页表进行迁移: remove_migration_ptes establish the mapping relationship from new page to process through the reverse mapping mechanism.

@9 The migration is completed to release the PG_locked of old and new pages, of course, we also need to put_anon_vma reduced anon_vma->refcount reference count for anonymous pages

@10 For non-LRU pages, call put_page, release old page reference count (_refcount minus 1)

For traditional LRUs putback_lru_page add newpage to the LRU linked list.

4.4 move_to_new_page

In @5 and @7, both non-LRU and LRU pages replicate pages through move_to_new_page, let's look at his implementation:

For non-LRU pages, this function calls back to the driven miratepage function for page migration

For example, the zsmalloc memory allocator will register the migration callback function, and the migration process will call the zsmalloc zs_page_migrate to migrate the page it requested. The zsmalloc memory allocator will not be discussed here, and interested readers can read the source code of zsmalloc.

2. For LRU pages, the calling migrate_page does two things: copy the properties of the struct page and the page content.

@7.1 Replication of struct page attributes:

migrate_page_move_mapping first check whether the page's refcount meets expectations, and then copy the page's mapping data, such as page->index, page->mapping, and PG_swapbacked

Incidentally, refcount is the heavy reference count in the struct page, which is used to indicate the number of times the kernel has been referenced to change the page. When refcount=0, the page is free or about to be freed. When the value of refcount > 0, it means that the page has been allocated and the kernel is in use, and will not be released temporarily.

Using functions such as get_page, pin_user_pages, get_user_pages, etc. in the kernel to increase the reference count of _refcount can prevent the page from being freed by other paths during certain operations (such as adding to the LRU), and it will also cause the refcount to be unexpected, that is, it cannot be migrated here.

@7.2 Reproduction of page content:

copy_highpage it's as simple as using kmap to map two pages and copying the memory of the old page to the new page.

@7.3 migrate_page_states flags used to copy pages, such as PG_dirty, PG_XXX, etc., are also copies of struct page attributes.

4.5 Summary:

The whole migration process has been analyzed, and the flowchart is drawn as follows.

V. Summary

From the analysis of the above chapter, we can see that the design of CMA is done around these two points:

1. When the device driver is not in use, the CMA memory is managed by Buddy, which is implemented cma_init_reserved_areas () or cma_release () in the initialization process.

2. When the device driver is to be used, apply for a physically continuous CMA memory through the cma_alloc. For pages that have been allocated by the app or kernel movable in Buddy, this memory should be cleaned up by recycling or migration, and finally this physically continuous "clean" memory is returned to the device driver for use. The core implementation is in the alloc_config_range() and migrate_pages() functions.

bibliography

1. The code referenced and interpreted in this article is from kernel-5.4https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/?h=v5.4.234

2. "Run Linux Kernel"

3. Song Baohua: On the full version of Linux Page Migration:

https://blog.csdn.net/21cnbao/article/details/108067917