KVM EPT implementation

Discussion:

Tony Roberts

2013-03-28 16:06:28 UTC

Hello list,

(Apologies if this appears twice!)

I'm currently doing some research into guest memory allocation,
specifically trying to determine when guests write data into certain
memory locations, and I'm trying to get my head around how KVM updates
the extended page tables, and where within the KVM code the actual
updates occur. I'm working on an Intel box with VT extensions, and
Debian 3.6.6 kernel.

After going through the code, I can see that a lot of the existing
shadow page table code is resued, however I'm a little confused over
how exactly that is.

As an example, I can see the function vmx_set_cr3 (vmx.c) being
called, which is setting the host CR3 to the base of the PML4 table.

Then from that address, the EPTP is created, essentially setting the
bottom 12 bits to various flags.

Then, handle_ept_violation is called which contains the GPA that
generated the page fault. I've looked into the function
kvm_mmu_page_fault which contains the value in the CR2, I'm assuming
this to be the guest's CR2 value, which I think is the guest physical
address that caused the page fault.

However this is where I lose the chase slightly. I know from studying
the Intel developers manuals that the top level of the 4 level
hierarchy for the EPTs is the PML4 table, which can contain a maximum
of 512 64-bit entries, with each entry in turn pointing to the base
address of a PDPT.

The first address that the function pte_list_add sees is the base
address of the PML4 table, so I was expecting to be able to read 512
64-bit entries from that base address and see at least one 64-bit
entry written into that page. However, after a number of different
attempts, I'm unable to determine the function that is actually
responsible for updating the EPTs.

I was hoping somebody might be able to point me to the correct
location within the KVM source code to track when EPT entries are
actually written to the various tables in the 4 level hierarchy. The
function pte_list_add seems to do nothing more than change the value
of a pointer, but only the first address passed to it is page aligned
(the PML4 base) and the rest of the addresses appear to be pointers
into existing pages, often seeming to be outside of the PML4 page
range.

I might be completely misunderstanding something, but any advice on
how to effectively monitor EPT entries within KVM would be greatly
appreciated.

Thanks muchly.

Tony
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Zhang, Yang Z

2013-03-29 02:03:02 UTC

Permalink

Post by Tony Roberts
Hello list,
(Apologies if this appears twice!)
I'm currently doing some research into guest memory allocation,
specifically trying to determine when guests write data into certain
memory locations, and I'm trying to get my head around how KVM updates
the extended page tables, and where within the KVM code the actual
updates occur. I'm working on an Intel box with VT extensions, and
Debian 3.6.6 kernel.
After going through the code, I can see that a lot of the existing
shadow page table code is resued, however I'm a little confused over
how exactly that is.
As an example, I can see the function vmx_set_cr3 (vmx.c) being
called, which is setting the host CR3 to the base of the PML4 table.
Then from that address, the EPTP is created, essentially setting the
bottom 12 bits to various flags.
Then, handle_ept_violation is called which contains the GPA that
generated the page fault. I've looked into the function
kvm_mmu_page_fault which contains the value in the CR2, I'm assuming
this to be the guest's CR2 value, which I think is the guest physical
address that caused the page fault.
However this is where I lose the chase slightly. I know from studying
the Intel developers manuals that the top level of the 4 level
hierarchy for the EPTs is the PML4 table, which can contain a maximum
of 512 64-bit entries, with each entry in turn pointing to the base
address of a PDPT.
The first address that the function pte_list_add sees is the base
address of the PML4 table, so I was expecting to be able to read 512
64-bit entries from that base address and see at least one 64-bit
entry written into that page. However, after a number of different
attempts, I'm unable to determine the function that is actually
responsible for updating the EPTs.

Are you trying to dump guest PML4 table or EPT PML4? If for EPT, just look up EPTP(root_hpa in vcpu->arch.mmu.root_hpa). If for guest, you need to translate the gpa to hpa firstly.

Post by Tony Roberts
I was hoping somebody might be able to point me to the correct location
within the KVM source code to track when EPT entries are actually
written to the various tables in the 4 level hierarchy. The function
pte_list_add seems to do nothing more than change the value of a
pointer, but only the first address passed to it is page aligned (the
PML4 base) and the rest of the addresses appear to be pointers into
existing pages, often seeming to be outside of the PML4 page range.
I might be completely misunderstanding something, but any advice on how
to effectively monitor EPT entries within KVM would be greatly
appreciated.

You may start with mmu_alloc_direct_roots(). EPTP is assigned value in this function.

Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Paolo Bonzini

2013-03-29 09:20:28 UTC

Permalink

Post by Tony Roberts
I was hoping somebody might be able to point me to the correct
location within the KVM source code to track when EPT entries are
actually written to the various tables in the 4 level hierarchy. The
function pte_list_add seems to do nothing more than change the value
of a pointer, but only the first address passed to it is page aligned
(the PML4 base) and the rest of the addresses appear to be pointers
into existing pages, often seeming to be outside of the PML4 page
range.

The EPT tables are built lazily, as if they were shadow page tables.
There is "only" one difference in how they're built; namely, the tables
do not include the gva->gpa (guest virtual address->guest physical
address) translation.

This is very similar to how KVM builds shadow page tables when the guest
is running without pages. In both cases we have to build a page table
for gpa->hpa translation ("standard" OS page tables are hva->hpa). In
fact, many callbacks are shared between the two cases, and even when
they're not there are similarities. You can see that both
nonpaging_page_fault and tdp_page_fault invoke __direct_map, for example.

What makes the difference, of course, is that EPT tables are hardly ever
invalidated:

1) nonpaging_invlpg and nonpaging_sync_page are no-ops; this is also the
case with shadow page tables in nonpaging mode, as the names suggest.

2) neither invlpg nor CR3 loads and stores cause a vmexit in EPT mode.
kvm_set_cr3, and hence (via nonpaging_new_cr3) mmu_free_roots, are
hardly ever called. You could get a call from the emulator in rare
cases, which would cause a spurious flush, but that never happens in
practice.

(This doesn't include nested virtualization, which uses the MMU in
another mode with its own callbacks---see init_kvm_nested_mmu. I think
this is beyond what you currently care about, though).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html