kvm: virtual x86 mmu setup

One of the initialization steps that KVM does when a virtual machine (VM) is started, is setting up the vCPU's memory management unit (MMU) to translate virtual (lineal) addresses into physical ones within the guest's domain. For x86, which is what will be covered here, most of the corresponding code is in <kernel>/arch/x86/kvm/mmu.c.

Disclaimer: Although this document requires at least some basic knowledge of x86 paging and traditional virtual memory, I hope it can be useful for people that are interested in low-level virtualization, linux kernel and/or KVM internals in general.

kvm_vm_ioctl (KVM_CREATE_VCPU分支)->kvm_vm_ioctl_create_vcpu->kvm_arch_vcpu_setup->kvm_mmu_setup->init_kvm_mmu

The first step call kvm_mmu_setup() which simple does some trivial asserting and calls init_kvm_mmu():

static int init_kvm_mmu(struct kvm_vcpu *vcpu)
{
        vcpu->arch.update_pte.pfn = bad_pfn;

        if (tdp_enabled)
                return init_kvm_tdp_mmu(vcpu);
        else
                return init_kvm_softmmu(vcpu);
}

The first check is regarding nested MMUs, which is to run VMMs within guests, having yet another layer of indirection. This is part of the Turtles project and won't be covered in this document, but it is well documented elsewhere.

The tdp_enabled (two dimentional paging) boolean variable determines wether or not hardware assisted paging (EPT or RVI/NPT) is enabled. If true, it will use 2D paging, otherwise, the default option, shadow paging through software only support. Since KVM can be built as a kernel module, it uses the user's options to set the variable's value, with kvm_enable_tdp() and kvm_disable_tdp(). For example, users can check /sys/modules/kvm_intel/parameters/ept to verify if EPT is enabled or not. Most distributions will load the module with it enabled, anyway:

#> modprobe kvm_intel ept=1

Both init_kvm_tdp_mmu()and init_kvm_softmmu() are responsible for setting up how the guest's page walking will be handled, by populating the walk_mmu (kvm_mmu) structure. This structure abstracts the details of architecture-specific paging modes, allowing common operations like loading and setting CR3 for upper page level base pointer, flushing TLB entries ( invlpg) and page fault handing, among others.

static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
{       
        struct kvm_mmu *context = &vcpu->arch.mmu;
        
        context->new_cr3 = nonpaging_new_cr3;
        context->page_fault = tdp_page_fault;
        context->free = nonpaging_free;
        context->prefetch_page = nonpaging_prefetch_page;
        context->sync_page = nonpaging_sync_page;
        context->invlpg = nonpaging_invlpg;
        context->shadow_root_level = kvm_x86_ops->get_tdp_level();
        context->root_hpa = INVALID_PAGE;
  .....
}

Just like traditional, non virtualized environments, the guest's MMU must be capable of handling paging in 32bit, PAE, 64bit, optionally it can have paging disabled, so guest virtual addresses (gva) are the actual guest physical addresses (gpa), mapped 1:1. This is quite obvious since the guest's does not know that its MMU is the one KVM presents to a it, and not the real, physical one - making everything transparent - which is not the case for paravirtualization, like Xen.

Hardware support initialization

Most logic is done in this single function:

static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
{       
        struct kvm_mmu *context = &vcpu->arch.mmu;
        
        context->new_cr3 = nonpaging_new_cr3;
        context->page_fault = tdp_page_fault;
        context->free = nonpaging_free;
        context->prefetch_page = nonpaging_prefetch_page;
        context->sync_page = nonpaging_sync_page;
        context->invlpg = nonpaging_invlpg;
        context->shadow_root_level = kvm_x86_ops->get_tdp_level();
        context->root_hpa = INVALID_PAGE;
        
        if (!is_paging(vcpu)) {
                context->gva_to_gpa = nonpaging_gva_to_gpa;
                context->root_level = 0;
        } else if (is_long_mode(vcpu)) {
                reset_rsvds_bits_mask(vcpu, PT64_ROOT_LEVEL);
                context->gva_to_gpa = paging64_gva_to_gpa;
                context->root_level = PT64_ROOT_LEVEL;
        } else if (is_pae(vcpu)) {
                reset_rsvds_bits_mask(vcpu, PT32E_ROOT_LEVEL);
                context->gva_to_gpa = paging64_gva_to_gpa;
                context->root_level = PT32E_ROOT_LEVEL;
        } else {
                reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
                context->gva_to_gpa = paging32_gva_to_gpa;
                context->root_level = PT32_ROOT_LEVEL;
        }

        return 0;
}

The is_paging() function simply checks the vCPU's CR0.PG flag to see if paging is enabled or not - this will most likely be enabled!
The is_long_mode()checks if the guest has a 64bit vCPU, by reading the EFER.LMA (long mode active) bits, assuming, of course, CONFIG_X86_64 is set, since 64bit guests cannot run on 32bit hosts.
If PAE is enabled, then is_pae()'s CR4.PAE check will return successfully and indicate that the physical address extension is present, and the 32bit guest can reference more than 4Gb of address space.
Finally, if the above three fail, its assumed that the guest works in standard 32bit mode.

No matter what mode is set, no-execution bits, rsvds bits, what function will handle gva to gpa translation and the paging's root level is set:

The ->nx flag refers to No-eXecution bits to separate areas of memory from being executed, avoiding buffer overflow attacks. This is obtained by checking vCPU's EFER.NX flag.

The ->gva_to_gpa is the function that will handle guest's virtual to physical translations, discussed here. When paging is disabled, the gpa is returned, and for the other modes, gva_to_gpa() is the same function (defined in paging_tmpl.h), but varies according to the root level and paging mode.

The reset_rsvds_bits_mask() function just sets the reserved machine memory.

Finally, the page walker's ->root_level refers to the amount of hierarchical levels of guest's paging. With the standard 4k page size, 64bits will have four (PML4, PDP, PD, PTE), 32bits will have two (PD, PTE) and PAE will have three (PDP, PD, PTE). If paging is disabled, there obviously won't be any levels to walk.

Software support initialization

Unlike hardware support, most of the work for setting up software MMU and shadow page is done by kvm_init_shadow_mmu(), while init_kvm_softmmu() simply calls it and later sets control register 3, page directory pointer and how the VMM will emulate (inject) and propagate the page faults.

static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
{
        int r;

        ASSERT(vcpu);
        ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));

        if (!is_paging(vcpu))
                r = nonpaging_init_context(vcpu);
        else if (is_long_mode(vcpu))
                r = paging64_init_context(vcpu);
        else if (is_pae(vcpu))
                r = paging32E_init_context(vcpu);
        else
                r = paging32_init_context(vcpu);

        vcpu->arch.mmu.base_role.glevels = vcpu->arch.mmu.root_level;

        return r;
}

The kvm_init_shadow_mmu() function is quite similar to what was discussed above, based on the paging modes, it sets how the walker will work paging32_init_context_common() and paging64_init_context_common(), for 64bit and PAE systems.

kvm: virtual x86 mmu setup

继续阅读

OO设计的重要原则

《JAVA编程思想》第四版学习需要我记住的something –复用类

ms sqlserver常用sql语句

华为笔试软件

probe()函数是什么时候被调用，设备和驱动是怎么联系起来的

二叉树先序、中序、后序三种遍历的非递归算法

C/C++头文件、函数使用说明

effective c++ 第三版读书笔记1

可变参数宏， Variadic Macros

设置某一行背景颜色的CListCtrl

BMP文件结构及图像每行字节计算方法

linux网络编程----发送与接收文件

处理PCX文件

不用iconv函数实现UTF-8编码转换GB2312的PHP函数

无组件上传图片到数据库中，最完整解决方案

Linux设备模型（中）之上层容器