new proposal for doing SMP [long]

Ingo Molnar (mingo@pc5829.hil.siemens.at)
Tue, 21 Jan 1997 14:54:21 +0100 (MET)


On Sun, 19 Jan 1997, David S. Miller wrote:

> [...] So I'll brain dump what Linus and I
> have come up with so far [...]

the following stuff will sound redundant, but i think the list of
implementation possibilities gets more apparent when we categorize the
stuff we have so far, from the hardware side, through our current
assumptions in the kernel source code, down to the 'maximum possible
implementation'. Sometimes i will repeat things in a painfully slow way,
but this is just to be sure that everybody is on the same track, and that
i dont make false assumptions. [but anyways, you can delete this mail
right now :)]

on the hardware side we have:

- N CPUs, each with it's own context

- a pool of IRQ contexts, which are spontaneously started by an
external event. We can mask off any context, and once started we
have to explicitly re-enable the next start for the same context.

on the software side

- there are much more uniprocessor systems around than SMP systems,
so the most we can await from our fearless leader for now is to let us
change the irq, entry and scheduling code =B-)

- we have nearly 1 million lines of code, from which ~300 thousand lines
of code (irq handlers) are prepared to run parallel. The rest of code
can be interrupted, but it can run only on one CPU at once.

so we have .. 1 'main kernel' context, and M irq contexts on N CPUs. The
current Linux implementation for fast irq handlers assumes that the same
irq context runs only after it has finished completely. [when we talk
about irq contexts ... in the Intel case it's the hardware interrupt. For
example the IDE driver can have two interrupt contexts and it's well
prepared for them: one context on irq 14, one on irq 15]

Thus the maximum parallel implementation we can aim for currently is to
have the main context running on 1 arbitrary CPU, and arbitrary (but
exclusive to themselves) IRQ contexts running on all CPUs. [omitting
user-space for this mail]

All contexts define 'atomic blocks', starting with a cli(), and stopping
with a sti(). The code relies heavily on this. If a cli() atomic block is
activated, other contexts might run, but only in nonatomic state. Once
they hit a cli() block, they have to wait.

Such atomic blocks can be nested, this is achieved through save_flags()
and restore_flags(). Whenever such atomic block is active, no other
context is allowed to run freely, on no CPU. Thus it is sufficient to
define the nesting on the same CPU, like the current method, by using a
stack variable (flags). This way save_flags() and restore_flags() can be
redefined by using cli() and sti() only, thus we can forget about them.

the irq contexts themselves both listen to cli()/sti(), and use it too.
Thus we have 'one main locking service', which syncs the main kernel and
the irq contexts. And we have the current assumption that one irq context
can only run once.

what does this mean ... generally, we do not want to restrict the external
(in hardware) irq scheduler (the IO APIC in the Intel SMP case), thus we
want to be prepared to get a new irq context started on any CPU, if:

- if the same irq context is not running already
- and if there is no atomic block running

>From the above i think it's apparent that we have three conceptual locking
restrictions: 'main context only once', 'listen to cli', 'same irq only
once'.

[ important sidenote: software contexts with no implicit cli/sti locking
are well possible, thats why do not want to have a 'global hardware
cli' ]

Part of the above locking semantics can be implemented in hardware: the
'same irq only once' can be done on the Intel platform by masking off one
specific irq on the PIC. This might or might not be possible on other
platforms.

implementation problem: at least on the Intel architecture, the hardware
cannot provide the other two locking requirements in an atomic way. We
cannot atomically 'lock off one irq and set a global lock'. We cannot
atomically 'enable other irq contexts, and clear the global lock'. I think
it's not possible on many other platforms too, thus lets discuss this as a
main kernel implementation issue :) This is why we have to be prepared to
soft handle irqs that might have arrived during an atomic block.

[ as a sidenote, on Intel we could do this atomically, i think we could
mask off all IRQs on the PIC/APIC in a 'cli()', but this is so painfully
slow {on my system} that i think we have to go the soft way. But it's a
theoretical possibility :) ]

As with any soft lock implementation, we have to be prepared to handle any
intermediate state of the locking system. irq contexts might get started
anytime during the manipulation of the locking system, and generally they
might see any intermediate state while the state is being changed on
another CPU. Or they might interrupt a context on the same CPU. Two irq
contexts might get started >at once< on two CPUs (in the Intel case if
there are several IO APICs and hardware irq routing set up cleverly).

I think the simplest but still functional implementation would be to
protect our complex locking system with a generic one-bit lock. Thus both
the cli/sti and the irq entry code has to spinlock for this one-bit lock,
to be able to muck with our irq scheduling and locking state.

We generally have to split up our 'irq context' into 'hardware irq
context' and 'software irq context'. The above abstract locking semantics
are all true for the software irq context (or 'abstract irq context').

To sum it up, we have to following hardware possibilities:

- M hardware irq contexts
- all hw contexts can be masked off on one CPU (<--- hw_cli())
- one hw context can be masked off from all CPUs (<--- PIC masking)

In the locking system we have to implement the locking semantics defined
on the software irq contexts and the main context.

Whenever we manipulate the locking system, it looks apparent to first mask
off all hw contexts from the CPU doing the manipulation. This does not lock
off other hw contexts getting started on other CPUs. Then we should
spinlock for the cli flag.

We have two types of software irq contexts currently: those with an
implicit cli() at their beginning [fast irq handlers], and those with
no implicit cli() [slow irq handlers]. This differences can be used for
optimizations: slow sw irq contexts can be started without competing
for the global cli/sti lock. [they still have to listen to the 'only
one sw irq context at once' rule]

after the spinlock is done and the context has aquired the cli lock,
it can continue execution, until a real sti().

There might be other cli()'s along it's way, but in the generic code,
the irqs off flag means that we have the cli() lock! Thus instead of
the current way of doing a blind hw_cli(), we just have to test the
cli flag in a lightweight nonatomic way {no one is supposed to take
away our cli flag}. Right from the cache ... 2 cycles or so =P

So the cli() code would look like something like this:

if (!we_have_the_cli_flag) {
hw_cli();
run_for_the_cli_flag();
}

the point is to take away the complexity from the cli code, and put
all the soft locking trouble into the irq scheduling and irq entry
code (and the normal scheduling and entry code).

( because the frequency of cli/sti is much higher than the frequency
of irq and scheduling events. It's an implementation decision. )

[ and it might even make the uniprocessor case faster ... as this
makes 'nested cli()s' slightly faster ... 7 clocks for a hw_cli(),
against a cached memory test operation which takes some 2 or 3
cycles, when the cli lock is already ours.]

AND, very important, we have not talked about the main context so far.
For the cli/sti atomic block semantics, there is no difference between
the sw irq contexts and the main contexts. I think all the required
main context semantics can be implemented in the scheduling and entry
code. [plus at the end of a slow sw irq context we have some
scheduling semantics driven through the global need_resched flag]

Back to the cli/sti block issues. If somebody holds the cli/sti flag,
hw irq contexts might get started on another CPU. If that context
is a slow irq context, it wont race for the cli/sti flag, so it will
happily start execution. If the new hw irq context belongs to a fast
irq handler, the most trivial implementation is to do a simple cli()
in C, as part of the handler routine. But we have other implementation
possibilities too: for example we could implement an 'irq scheduler',
which schedules 'blocking sw fast irq contexts'. Thus we could continue
executing useful code on the same CPU. Generally we cannot count on
a cli/sti block being short, thats why this might be a viable
possibility.

If a cli/sti block gets to a sti(), the easiest implementation is to
give up the lock and continue execution. As hw irq contexts are blocked
on the same CPU, it cannot happen that the cli/sti block gets
interrupted by a hw irq context. [but another CPU might see one]

but if the irq scheduler is implemented, sti() has to 'wake up' blocked
hw irqs. [most probably by sending a special irq to the other CPU].
The irq scheduler implementation is complex, and i dont know wether it
makes sense to 'block on the cli flag'. It could be made fast,
since the 16 Intel interrupts can be well fit into a 32 bit bitmask.
[dont know how this would work out on other architectures]

about the 'main context'

The main context can only run once. Theoretically, a single flag
would be enough to achieve this. Other CPUs do not have to know
which CPU is running the main context. They just have to know
that it's not them who has the flag :)

the only reason i'm aware of why we have the active_processor[]
array currently is the TLB shootdown semantics we have to
implement ... [more about this later]

So far it's enough to have 1 flag for cli/sti blocks, and one
flag for the main context. Both are spinlocks. Should the irq
scheduler be implemented? If cli/sti blocks are very small
then there is no need for it. If they are big, then we might
win a bit more user-context execution time.

About nested interrupts.

'nested interrupts' are easy in the uniprocessor case. They
show the level of active irq handlers. The 'intr_count'
variable changes some kernel semantics slightly, we get
warnings when we try to sleep, and they change the way
of returning from a slow interrupt handler.

[i hope i havent missed any important semantics here]

Now, the real meaning of 'intr_count' makes only sense for
the >caller<. Not for the other running contexts. Thus we
have to make this variable per-CPU. In the Intel case we
could sacrifice one bit from %%gs or %%fs to show that
we are in an interrupt handler. Or one bit in current->.

Do we need to know what level the nesting has? I see only
one reason for it, to separate the return path for syscalls
and for irq handlers, and i guess this problem can be
solved by simply separating them .... makes the code
simpler/faster, at the expense of some cache memory.

so ... nested irq semantics have brought us a per CPU flag,
which imho should be put into a CPU-local variable.

another issue: do we need to know about the 'current process' in
an interrupt handler? If we get rid of the 'i know which CPU i'm
running on' stuff, we should not use current either.

about TLB shootdown and cross-CPU nonexternal hw irq contexts:

software generated hw irq contexts are possible on the Intel
platform, currently they are used for implementing TLB shootdown
and reschedule irqs. They should be separated from the 'normal'
external hw contexts, as they can be totally controlled.
Additionaly, in Alan's IRQ forwarding patches they are used for
'deferring' external hw irqs.

They should have different entry code, different exit code, thus
they can be handled separately from this cli/sti issue.

[Dave wrote:]

> The things to keep in mind for any proposed implementation of all of
> this seems to be:

> 2) The current interface c-code uses must work as is no matter
> how it is implemented, thus save_flags(), cli(), sti(),
> and restore_flags() must "appear" to the code which uses
> them now to do the same thing.
>
> 3) Things should work just as expected from within an
> interrupt handler.

My scheme breaks one thing. It assumes that other hw contexts can run
in noncli state while one context runs in cli state.

This should not break IMHO, but maybe there is too much code that doesnt
expect this, i dont know ... but this would be a big performance win.
If we want to stop ALL processors for doing a cli block, then we first
have to send them IRQs (or worse, have to let them notice our intention),
the we have to do the cli stuff, and have to restart those CPUs ... not
good.

but if my assumption is acceptable, then we can do all basic stuff with
2 onebit flags.

at the price of separated exit code, and separated TLB shootdown/reschedule
code. [which will admittedly have more complexity than the main irq
code ... ]

Ingo