Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

From: Zachary Amsden
Date: Thu Mar 08 2007 - 03:01:31 EST


Thomas Gleixner wrote:
On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote:
But more importantly, we want a kernel that can run both on native hardware and in a paravirtualized environment. Linux doesn't really provide abstractions for replacing the appropriate code. We tried to hook into the source code at a level that seemed possible.

Again. You just refuse to change your implementation and you want to
keep it by arguing how hard it is because there are no abstractions.

It is no longer possible to change our _hypervisor_ implementation. The Linux side of our code is entirely flexible, and we are trying to change it, but it hasn't always been clear what you want us to do.

Your prayer wheel argument of missing abstractions and easiness of
emulating things is annoying. If you think it is better to emulate APIC,
please emulate it without paravirt ops. If you want the speed
improvement, work with us to create the interfaces and abstractions
which are necessary to have a sane, maintainable and useful for all
hypervisors implementation.

That's what we are doing. Our prayer wheel would be easier appeased if you actually told us which parts of the VMI timer you objected to. As I understand it now:

1) We should not call into external functions in other time sources; any common code should be merged up
2) We should not be using global_clock_event; it is a horrible hack which you want to remove
3) We should not use the smp_apic_timer_interrupt assembly code which calls up to the lapic timer handlers
4) We should not add our own assembly code to call out to a local timer handler (from Ingo)

These last two points create a conflict which is a little tricky to solve. We can't add our own custom timer handler, and we can't re-use the APIC timer handler. But there is no timer handler available on i386 that works, since the handlers will fall back to either PIC or IO-APIC edge handling. Using either of those for the local timer interrupt on SMP does not work because they assume traditional IRQ semantics - an IRQ raised from the bus should be serviced by one processor. Re-raises of the same IRQ on remote processors are locked out by the handler, and dropped. Thus simultaneous local timers firing on multiple CPUs cause only one to be serviced.

This does not work for local timer interrupts in NO_HZ mode, because they must always be serviced so that they can reschedule the next local timer. I have a proposed solution to this issue, but it fails to work when the IO-APIC assumes control of all IRQs based on ACPI results (which we control, but can't change because of compatibility issues with other operating systems).

My proposal is to keep IRQ-0 as the timer interrupt, on all CPUs, but fire it from the LAPIC after local apic timers get initialized. We would do this by converting the irq handler using set_irq_handler(0, handle_percpu_irq). The only problem is the IO-APIC code will want to take over IRQ0 and convert it to an edge triggered IO-APIC interrupt. But for the local irq handlers to work, we have to keep them using the handle_percpu_irq handler, and can't let the IO-APIC steal these vectors. There is no way to do conditionally for just a specific set of IRQs in tree today, so we would need to add a special case to io_apic.c to allow early boot code to reserve specific vectors so they are not subsumed by the IO-APIC. This seems reasonable, but is a special case.

If, on the other hand, we are allowed to use our own assembly code to call out to our local timer handler (dropping constraint #4 above), we can simply rewire LOCAL_TIMER_VECTOR to point to this code, but now we must emulate the semantics of irq_enter / leave / etc inside our code, which is also not the cleanest solution. We used to do this, and it caught flak I believe from Ingo.

The basic problem is that a local IRQ doesn't behave like a global IRQ, and the i386 backend is unaware of how to set up any local IRQs except in the case of local APIC, but you have told us we should not re-use the APIC handlers by overloading global_clock_event. The patches we sent out recently did just this, but seemed to meet even more violence than our previous way of doing things.

So the question is, which approach do you prefer?

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/