Re: RT-Linux and SMP

Michael Callahan (mjc@stelias.com)
Fri, 25 Apr 1997 16:04:23 -0400 (EDT)


On Thu, 24 Apr 1997, Victor Yodaiken wrote:
>Of course. I'm obviously not making my question sufficiently clear.
>What you seem to be telling me is that some IRQ code modifies critical
>data outside of a cli/sti pair. If this were not true, then simply
>making sure that at most one processor was within a cli/sti pair
>would suffice.

I spent an afternoon last weekend trying to understand the current SMP
code, so I'll venture an answer here. I hope someone more
knowledgeable will step in if I get it wrong.

My understanding of Linux's UP concurrency model is this:
- mainline kernel code can be interrupted by interrupts, except of
course during cli/sti pairs
- slow interrupts can be interrupted by other interrupt handlers,
except during cli/sti pairs--however, a given IRQ is blocked until
its handler(s) finish, so individual handlers for individual IRQ
lines are nonreentrant
- fast interrupts generally run with interrupts off

If a driver's upper-level and its interrupt handler need to share data
in this scheme, it suffices for the upper-level to block interrupts
while accessing the data. This works as long as only one IRQ's
handler could modify the data. However, that situation--data shared
between a driver upper-half and a single IRQ handler--is pretty
common.

So, yes, if I understand things properly, IRQ code can modify critical
data safely without doing cli/sti. (Of course, for data that multiple
IRQ handlers might try to modify, the handlers themselves must block
interrupts.)

The MP implementation in recent 2.1 kernels appears to be aiming to
provide the same model to device driver writers. Namely, mainline
code can assume that during a cli/sti pair no interrupt routine will
be executing (and so cli might have to wait for some interrupt
handlers to finish before proceeding). Interrupt handlers can assume
that no other interrupts on the same IRQ line will execute
concurrently, and can assume that no mainline code is running between
cli/sti while it is running.

>On a related issue. In this SMP design, it looks like cli has no
>effect at all on non-irq kernel mode. That is, if one processor
>does a cli while in a system call, nothing prevents a second processor
>from also doing a cli and entering the same code. Did I miss something?

I think this code from arch/i386/kernel/irq.c (similar code is in
arch/sparc/kernel/irq.c) shows that only one cpu can be in a cli/sti
pair at once:

get_irqlock(), called from __global_cli (and asm-i386/system.h
#defines cli() to __global_cli())

if (set_bit(0,&global_irq_lock)) {
/* do we already hold the lock? */
if ((unsigned char) cpu == global_irq_holder)
return;
/* Uhhuh.. Somebody else got it. Wait.. */
do {
do {
STUCK;
check_smp_invalidate(cpu);
} while (test_bit(0,&global_irq_lock));
} while (set_bit(0,&global_irq_lock));
}

So, when a cpu exits this loop, it has acquired the global_irq_lock. In
fact, wait_on_irq (called just after this excerpt) can do some pretty
complicated stuff (including dropping the lock and then getting it back),
but all the code seems to honor the invariant that only one processor can
be in a cli()/sti() block at once, and that processor holds the
global_irq_lock and stuffs its number into global_irq_holder.

----

By the way, I have a question about the current scheme. Consider the
following code:

int foo;

baz_upper_half() {
...
save_flags(flags);
cli(); /* =1= */
foo = whatever; /* OK to modify, baz_irq_handler
not running */
restore_flags(flags);
...
}

baz_irq_handler() {
int cached_foo;
...
foo++;
cached_foo = foo; /* OK to read, modify, because
baz_upper_half always does cli
*/

...
cli(); /* =2= */
... mutate something unrelated to foo here ...
sti();

if (cached_foo != foo)
panic("Can't happen in UP land!");
}

The panic never happens in UP mode, but as I see it it can on a MP kernel.
As I read it, it's possible for two cpu's to enter cli() at =1= and
=2= simultaneously and, furthermore, it's possible that =1= will win
the race. The result is the panic() where the interrupt handler
realizes it's very confused.

Here's the sequence, by the way:

0) cpu 0 enters cli() at =1=, cpu 1 enters cli() at =2=,
cpu 2 is in some unrelated interrupt handler;
global_irq_count = 2
local_irq_count[] = 0, 1, 1
1) cpu 1 wins the race in get_irqlock, enters wait_on_irq
2) cpu 1 in wait_on_irq notices that local_count (1)
is not equal to global_irq_count (2), so it decrements
global_irq_count by local_count, drops the lock, and
starts spinning
3) review: now,
cpu 0 is spinning in get_irqlock on global_irq_lock
cpu 1 is spinning in wait_on_irq on global_irq_count!=0
cpu 2 is still slogging along in its interrupt handler;
global_irq_count = 1
local_irq_count[] = 0, 1, 1
4) since only one cpu is contending for global_irq_lock,
cpu 0 grabs it and enters wait_on_irq; it notices that
local_count (0) != global_irq_count (1) so drops the
global_irq_lock and decrements global_irq_count by 0
5) now both cpu 0 and cpu 1 are spinning in the wait_on_irq
loop, waiting for global_irq_count to go to zero
6) cpu 2 finally finishes its interrupt handler, now:
global_irq_count = 0 !!
local_irq_count[] = 0, 1, 0
7) cpu 0 wins the wait_on_irq race, grabs the global_irq_lock
and returns to get_irqlock, which returns to global_cli,
which returns to baz_driver_upper

Here baz_upper_half wins the race and gets to modify foo, even though
a UP device driver writer would never expect that upper-half code
would get a chance to run in the middle of an interrupt handler.

Admittedly, I don't know if this gotcha is likely to happen, but if my
understanding is correct, it does seem important for device driver writers
to know that mainline kernel code can execute cli/sti blocks in the middle
of an IRQ handler if that handler itself does cli(). Or perhaps
wait_on_irq should be fixed to make sure that callers with local_irq_count
> 0 beat callers with local_irq_count == 0.

Michael