Discussion:
VT-x and Performance counter interrupt in KVM mode
Stephane Eranian
2007-03-27 16:35:15 UTC
Permalink
Hi Avi,
I am trying to capture in vmx.c the hardware
performance counter(PMU) interrupt of a i386 Linux
kernel running with perfmon on a Core 2 Duo machine
running with kvm-15. host is running kvm with VT-x in
x86-64 mode.
The PMU interrupt is programmed in the APIC LVT entry
(set to 0xee)by the guest OS.
On stock kvm, the guest os programs a virtual apic that lives in qemu,
not the real apic, so it would never cause any interrupt. Are you
running with a modified kvm that allows the guest to touch the real apic?
The Performance counters (PMU) cannot be fully virtualized, they need to
run on the actual MSR registers. The PMU interrupt is controlled by the
local APIC. To get overflow-based sampling to work in a guest, we need to
allow the PMU to interrupt. Supposing we have allowed wrmsr,rdmsr to the
PMU registers, the guest perfmon will setup the virtual APIC and virtual
IDT as it normally would on real HW. VT-x takes care of the IDT but not
of the APIC. The guest never touches the real APIC, qemu handles this.
However if the host kernel is running perfmon, it does already have the
actual APIC programmed for the PMU.

In this configuration, the host perfmon interrupt driver catches the PMU
interrupt generated while running in non-root VMX mode. At that point, there
is a VM-exit. I have now been able to track down the type of exit in this
case. You have a VM-exit for an external interrupt, which is fine, however
the intr_info (VM_EXIT_INTR_INFO) is 0x0, in other words, VT-x does not give
you any good info as to why you exited. As soon as you leave the VM_RESUME code,
you branch to the host perfmon interrupt handler.

In any case, the current solution I have for this is sort of hybrid because
you rely on the host APIC to be programmed correctly, and then you need
communication between the host perfmon code and the KVM kernel code to be
able to inject the PMU interrupt back into the guest. Another solution I have
experimented is for the host perfmon to notify the user level qemu APIC code
(SIGIO) which then issues the right KVM_INTERRUPT ioctl(), but that is slow
and has some rce condition with the guest.

The timer interrupt, also normally controlled by the APIC, is managed differently
and can be fully virtualized by qemu using Linux timers. The PMU cannot be
virtualized that way.

At this point, even if you had APIC emulation in KVM (kernel), I am not sure
this would solve this issue. I think I can live with having back communication
between the host perfmon and KVM.

Any better ideas?
Similarly, an IDT entry
connects the interrupt vector to the interrupt
handler.
I am not able to catch, in kvm, the PMU interrupt
happening in VMX non-root mode. It does not seem to
appear in the VM-exit interruption information nor in
the IDT-vectoring information. It does not seem to
be caught by any of the exit handlers yet the host PMU
interrupt handler catches it which is not what we
want.
Any idea on what is going on with this interrupt?
It looks completely normal, assuming the host also programmed the timer
to the same vector. Look in qemu/hw/apic.c to find your missing interrupt.
--
-Stephane

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
Avi Kivity
2007-03-27 17:10:58 UTC
Permalink
Post by Stephane Eranian
Hi Avi,
I am trying to capture in vmx.c the hardware
performance counter(PMU) interrupt of a i386 Linux
kernel running with perfmon on a Core 2 Duo machine
running with kvm-15. host is running kvm with VT-x in
x86-64 mode.
The PMU interrupt is programmed in the APIC LVT entry
(set to 0xee)by the guest OS.
On stock kvm, the guest os programs a virtual apic that lives in qemu,
not the real apic, so it would never cause any interrupt. Are you
running with a modified kvm that allows the guest to touch the real apic?
The Performance counters (PMU) cannot be fully virtualized, they need to
run on the actual MSR registers. The PMU interrupt is controlled by the
local APIC. To get overflow-based sampling to work in a guest, we need to
allow the PMU to interrupt. Supposing we have allowed wrmsr,rdmsr to the
PMU registers, the guest perfmon will setup the virtual APIC and virtual
IDT as it normally would on real HW. VT-x takes care of the IDT but not
of the APIC. The guest never touches the real APIC, qemu handles this.
However if the host kernel is running perfmon, it does already have the
actual APIC programmed for the PMU.
In this configuration, the host perfmon interrupt driver catches the PMU
interrupt generated while running in non-root VMX mode. At that point, there
is a VM-exit. I have now been able to track down the type of exit in this
case. You have a VM-exit for an external interrupt, which is fine, however
the intr_info (VM_EXIT_INTR_INFO) is 0x0, in other words, VT-x does not give
you any good info as to why you exited. As soon as you leave the VM_RESUME code,
you branch to the host perfmon interrupt handler.
Actually it can be convinced to give the interrupt number. Right now,
we program VT not to ack interrupts, so we don't know their number, and
they are dispatched by the processor as soon as we enable interrupts on
the host.

An alternative mechanism exists. We can tell VT to ack the interrupt,
in which case the vector number becomes valid, but we need to dispatch
the interrupt ourselves using the 'int' instruction.

As I'd rather not do that, perhaps we can program the apic to issue an
nmi instead of an interrupt while in guest mode. On receipt of nmi, we
can call the host perfmon handler directly to interpret the performance
counters.
Post by Stephane Eranian
In any case, the current solution I have for this is sort of hybrid because
you rely on the host APIC to be programmed correctly, and then you need
communication between the host perfmon code and the KVM kernel code to be
able to inject the PMU interrupt back into the guest. Another solution I have
experimented is for the host perfmon to notify the user level qemu APIC code
(SIGIO) which then issues the right KVM_INTERRUPT ioctl(), but that is slow
and has some rce condition with the guest.
That looks promising. The slowness can be addressed by (first) moving
to queued signals instead of delivered signals and (later) pushing the
apic emulation into the kernel.

VT also has a facility to swap msrs on entry to the guest and back.
Post by Stephane Eranian
The timer interrupt, also normally controlled by the APIC, is managed differently
and can be fully virtualized by qemu using Linux timers. The PMU cannot be
virtualized that way.
At this point, even if you had APIC emulation in KVM (kernel), I am not sure
this would solve this issue. I think I can live with having back communication
between the host perfmon and KVM.
Any better ideas?
It really depends on what one wants to do with the performance monitor
on the guest:

- if it's just to shut up the nmi watchdog, we can report a cpu model
that does not have the performance monitor (which would be a classic
Pentium? or maybe a 486?)
- if we want something like the nmi watchdog to run, we can emulate all
counters based on cpu cycles, even if they count branches or something
else. That gives an inaccurate but sort-of-working counter, which we
can emulate using host timers.
- if we want real performance monitoring, we need to do the msr swap.
That has the disadvantage of disabling perfmon on the host, and of being
depressingly complex.

What application do you have in mind for kvm guest performance monitoring?
Post by Stephane Eranian
Similarly, an IDT entry
connects the interrupt vector to the interrupt
handler.
I am not able to catch, in kvm, the PMU interrupt
happening in VMX non-root mode. It does not seem to
appear in the VM-exit interruption information nor in
the IDT-vectoring information. It does not seem to
be caught by any of the exit handlers yet the host PMU
interrupt handler catches it which is not what we
want.
Any idea on what is going on with this interrupt?
It looks completely normal, assuming the host also programmed the timer
to the same vector. Look in qemu/hw/apic.c to find your missing interrupt.
--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
Stephane Eranian
2007-03-28 15:41:55 UTC
Permalink
Avi,
Post by Avi Kivity
Post by Stephane Eranian
The Performance counters (PMU) cannot be fully virtualized, they need to
run on the actual MSR registers. The PMU interrupt is controlled by the
local APIC. To get overflow-based sampling to work in a guest, we need to
allow the PMU to interrupt. Supposing we have allowed wrmsr,rdmsr to the
PMU registers, the guest perfmon will setup the virtual APIC and virtual
IDT as it normally would on real HW. VT-x takes care of the IDT but not
of the APIC. The guest never touches the real APIC, qemu handles this.
However if the host kernel is running perfmon, it does already have the
actual APIC programmed for the PMU.
In this configuration, the host perfmon interrupt driver catches the PMU
interrupt generated while running in non-root VMX mode. At that point, there
is a VM-exit. I have now been able to track down the type of exit in this
case. You have a VM-exit for an external interrupt, which is fine, however
the intr_info (VM_EXIT_INTR_INFO) is 0x0, in other words, VT-x does not give
you any good info as to why you exited. As soon as you leave the VM_RESUME code,
you branch to the host perfmon interrupt handler.
Actually it can be convinced to give the interrupt number. Right now,
we program VT not to ack interrupts, so we don't know their number, and
they are dispatched by the processor as soon as we enable interrupts on
the host.
An alternative mechanism exists. We can tell VT to ack the interrupt,
in which case the vector number becomes valid, but we need to dispatch
the interrupt ourselves using the 'int' instruction.
Ok, I missed that control but I see it now (bit 15).
Post by Avi Kivity
As I'd rather not do that, perhaps we can program the apic to issue an
nmi instead of an interrupt while in guest mode. On receipt of nmi, we
can call the host perfmon handler directly to interpret the performance
counters.
Yes, but that would be no different from what I have now without the ack-intr.
What you'd like is to catch the PMU intr right away and re-inject it without
using the host perfmon interrupt handler. It seem the only way to do this
is by acking intr. Unfortunately, it is an all or nothing control.

The other worry in this scheme is that the injection would be done without
qemu intervening. Thus you would not be able to check whether the virtual APIC
LVT vector is curently masked. Its configuration may be different from the
actual APIC. But that is probably ok for now. Is there a plan to move the
APIC emulation into KVM?
Post by Avi Kivity
Post by Stephane Eranian
In any case, the current solution I have for this is sort of hybrid because
you rely on the host APIC to be programmed correctly, and then you need
communication between the host perfmon code and the KVM kernel code to be
able to inject the PMU interrupt back into the guest. Another solution I have
experimented is for the host perfmon to notify the user level qemu APIC code
(SIGIO) which then issues the right KVM_INTERRUPT ioctl(), but that is slow
and has some rce condition with the guest.
That looks promising. The slowness can be addressed by (first) moving
to queued signals instead of delivered signals and (later) pushing the
apic emulation into the kernel.
VT also has a facility to swap msrs on entry to the guest and back.
Yes, I am using some of that to stop monitoring when entering KVM.
Post by Avi Kivity
It really depends on what one wants to do with the performance monitor
- if it's just to shut up the nmi watchdog, we can report a cpu model
that does not have the performance monitor (which would be a classic
Pentium? or maybe a 486?)
No, the goal is to provide full acecss to the PMU for performance monitoring
just like you would be able on bare HW.
Post by Avi Kivity
- if we want something like the nmi watchdog to run, we can emulate all
counters based on cpu cycles, even if they count branches or something
else. That gives an inaccurate but sort-of-working counter, which we
can emulate using host timers.
No, that's is my goal. I want to allow monitoring tools to run in a guest.
I think people would want to assess performance of their applications when
running in a guest. You can get the outside view using the host perfmon,
but you also want the inside view.
Post by Avi Kivity
- if we want real performance monitoring, we need to do the msr swap.
You mean if you do not want to conflict with the host using the PMU
for itself? Well, the host perfmon can take care of this.
--
-Stephane

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
Avi Kivity
2007-03-28 16:03:34 UTC
Permalink
Post by Stephane Eranian
Post by Avi Kivity
As I'd rather not do that, perhaps we can program the apic to issue an
nmi instead of an interrupt while in guest mode. On receipt of nmi, we
can call the host perfmon handler directly to interpret the performance
counters.
Yes, but that would be no different from what I have now without the ack-intr.
What you'd like is to catch the PMU intr right away and re-inject it without
using the host perfmon interrupt handler. It seem the only way to do this
is by acking intr. Unfortunately, it is an all or nothing control.
It is a little different, but perhaps not enough. If perfmon is the
only nmi source, or if you can find out the source of the nmi, then you
don't need to take the nmi but can instead call the perfmon handler.
Otherwise we'd need to dispatch interrupts manually.
Post by Stephane Eranian
The other worry in this scheme is that the injection would be done without
qemu intervening. Thus you would not be able to check whether the virtual APIC
LVT vector is curently masked. Its configuration may be different from the
actual APIC. But that is probably ok for now.
You certainly need to go through the apic for correctness, using a
signal like you outlined before might be a good interim solution.
Post by Stephane Eranian
Is there a plan to move the
APIC emulation into KVM?
Yes. It's needed for smp and kernel-only paravirt devices.
Post by Stephane Eranian
Post by Avi Kivity
It really depends on what one wants to do with the performance monitor
- if it's just to shut up the nmi watchdog, we can report a cpu model
that does not have the performance monitor (which would be a classic
Pentium? or maybe a 486?)
No, the goal is to provide full acecss to the PMU for performance monitoring
just like you would be able on bare HW.
Ok. I'm just glad I don't have to do it ;-)
Post by Stephane Eranian
Post by Avi Kivity
- if we want real performance monitoring, we need to do the msr swap.
You mean if you do not want to conflict with the host using the PMU
for itself? Well, the host perfmon can take care of this.
If the host wants system-wide monitoring (% cpu / tlb miss / whatever in
each process, including vms) and a vm wants monitoring too, then you
don't have enough resources to go round. There's a similar problem with
the debug registers; if the host wants to debug a guest, which is itself
debugging a process, something has to give.
--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
Loading...