Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

From: Daniel Arai
Date: Wed Mar 07 2007 - 20:02:02 EST

Next message: Eric W. Biederman: "Re: [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments"
Previous message: Paul Menage: "Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy!"
In reply to: Thomas Gleixner: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Next in thread: Jeremy Fitzhardinge: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Thomas Gleixner wrote:

You managed to avoid the usage of other code (i.e. PIT / HPET) already,
so why is it sooo desireable to emulate apics instead of substituting it
by a small and sane replacement ? Just because you happen to have an
LAPIC emulator ? That's no reason to wire yourself into the kernel code
and make it harder to change and maintain.

There are several reasons why it's desirable to emulate the APIC. As you mentioned, we already have APIC emulation, and APIC emulation isn't a huge bottleneck on most workloads. Our code works, the Linux code works, and replacing both pieces of code with something "small and sane" isn't going to improve performance very much, so why bother? Any hypervisor implementation is going to be a tradeoff between what's easy to implement in the hypervisor, what's easy to implement in the guest operating system, and what's performance critical.

Secondly, not all (para-)virtualized operating systems will want to use abstracted devices. Some virtual operating systems will be given direct access to hardware devices, and will need to run the actual driver for that device and not some abstracted device driver. So I don't buy your argument that every piece of the kernel that interacts with a paravirtualized driver should have a "small and sane replacement."

But more importantly, we want a kernel that can run both on native hardware and in a paravirtualized environment. Linux doesn't really provide abstractions for replacing the appropriate code. We tried to hook into the source code at a level that seemed possible.

For example, take smp_call_function(). What this essentially does is call send_IPI_allbutself().

void fastcall send_IPI_self(int vector)
{
__send_IPI_shortcut(APIC_DEST_SELF, vector);
}

void __send_IPI_shortcut(unsigned int shortcut, int vector)
{
/*
* Subtle. In the case of the 'never do double writes' workaround
* we have to lock out interrupts to be safe. As we don't care
* of the value read we use an atomic rmw access to avoid costly
* cli/sti. Otherwise we use an even cheaper single atomic write
* to the APIC.
*/
unsigned int cfg;

/*
* Wait for idle.
*/
apic_wait_icr_idle();

/*
* No need to touch the target chip field
*/
cfg = __prepare_ICR(shortcut, vector);

/*
* Send the IPI. The write to APIC_ICR fires this off.
*/
apic_write_around(APIC_ICR, cfg);
}

There's no good way to override __send_IPI_shortcut. I suppose we could add paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. But there are dozens of functions in apic.c that would need to be included in paravirt ops. And for our implementation, we really just want to override apic_read and apic_write, since we can make these faster when done through hypercalls than through memory accesses. If we were to make these paravirt ops, their implementations would be the same, except with a different apic_read and apic_write. This is a whole lot of useless code duplication.

Most of the interrupt system is not written in such a way that multiple APICs implementations can be selected from at boot time. This is an absolute requirement so that the same kernel can boot on native and in a paravirtualized environment. While this could be implemented, it seems like a waste of time, since we can just emulate something similar to a real interrupt system and not change things very much.

Dan Arai
VMware, Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric W. Biederman: "Re: [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments"
Previous message: Paul Menage: "Re: [ckrm-tech] [PATCH 0/2] resource control file system - aka containers on top of nsproxy!"
In reply to: Thomas Gleixner: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Next in thread: Jeremy Fitzhardinge: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]