Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()"locks up on ARM

From: Russell King - ARM Linux
Date: Thu May 26 2011 - 09:37:33 EST

Next message: Pavel Machek: "Re: [GIT PULL] omap changes for v2.6.39 merge window"
Previous message: Ted Ts'o: "Re: UNIX Compatibility"
In reply to: Ingo Molnar: "Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()"locks up on ARM"
Next in thread: Catalin Marinas: "Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()"locks up on ARM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, May 26, 2011 at 02:50:07PM +0200, Ingo Molnar wrote:
>
> * Russell King - ARM Linux <linux@xxxxxxxxxxxxxxxx> wrote:
>
> > On Thu, May 26, 2011 at 02:26:23PM +0200, Ingo Molnar wrote:
> > >
> > > * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > > Sort this by reverting to the old behaviour for this situation
> > > > and perform a full remote wake-up.
> > >
> > > Btw., ARM should consider switching most of its subarchitectures
> > > to !__ARCH_WANT_INTERRUPTS_ON_CTXSW - enabling irqs during
> > > context switches is silly and now expensive as well.
> >
> > Not going to happen. The reason we do it is because most of the
> > CPUs have to (slowly) flush their caches during switch_mm(), and to
> > have IRQs off over the cache flush means that we lose IRQs.
>
> How much time does that take on contemporary ARM hardware, typically
> (and worst-case)?

I can't give you precise figures because it really depends on the hardware
and how stuff is setup. All I can do is give you examples from platforms
I have here running which rely upon this.

Some ARM CPUs have to read 32K of data into the data cache in order to
ensure that any dirty data is flushed out. Others have to loop over the
cache segments/entries, cleaning and invalidating each one (that's 8 x 64
for ARM920 so 512 interations).

If my userspace program is correct, then it looks like StrongARM takes
about 700us to read 32K of data into the cache.

Measuring the context switches per second on the same machine (using an
old version of the Byte Benchmarks) gives about 904 context switches per
second (equating to 1.1ms per switch), so this figure looks about right.

Same CPU but different hardware gives 698 context switches per second -
about 1.4ms per switch. With IRQs enabled, its possible to make this
work but you have to read 64K of data instead, which would double the
ctxt switch latency here.

On an ARM920 machine, running the same program gives around 2476 per
second, which is around 400us per switch.

Your typical 16550A with a 16-byte FIFO running at 115200 baud will fill
from completely empty to overrun in 1.1ms. Realistically, you'll start
getting overruns well below that because of the FIFO thresholds - which
may be trigger an IRQ at half-full. So 600us.

This would mean 16550A's would be entirely unusable with StrongARM, with
an overrun guaranteed at every context switch.

This is not the whole story: if you have timing sensitive peripherals
like UARTs, then 1.1ms - 700us doesn't sound that bad, until you start
considering other IRQ load which can lock out servicing those peripherals
while other interrupt handlers are running.

So all in all, having IRQs off for the order of 700us over a context
switch is a complete non-starter of an idea.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Pavel Machek: "Re: [GIT PULL] omap changes for v2.6.39 merge window"
Previous message: Ted Ts'o: "Re: UNIX Compatibility"
In reply to: Ingo Molnar: "Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()"locks up on ARM"
Next in thread: Catalin Marinas: "Re: [BUG] "sched: Remove rq->lock from the first half of ttwu()"locks up on ARM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]