Re: [PATCH] sched_clock: Prevent 64bit inatomicity on 32bit systems

From: Steven Rostedt
Date: Mon Apr 08 2013 - 20:31:17 EST


On Sat, 2013-04-06 at 10:10 +0200, Thomas Gleixner wrote:
> The sched_clock_remote() implementation has the following inatomicity
> problem on 32bit systems when accessing the remote scd->clock, which
> is a 64bit value.
>
> CPU0 CPU1
>
> sched_clock_local() sched_clock_remote(CPU0)
> ...
> remote_clock = scd[CPU0]->clock
> read_low32bit(scd[CPU0]->clock)
> cmpxchg64(scd->clock,...)
> read_high32bit(scd[CPU0]->clock)
>
> While the update of scd->clock is using an atomic64 mechanism, the
> readout on the remote cpu is not, which can cause completely bogus
> readouts.
>
> It is a quite rare problem, because it requires the update to hit the
> narrow race window between the low/high readout and the update must go
> across the 32bit boundary.
>
> The resulting misbehaviour is, that CPU1 will see the sched_clock on
> CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
> to this bogus timestamp. This stays that way due to the clamping
> implementation for about 4 seconds until the synchronization with
> CLOCK_MONOTONIC undoes the problem.
>
> The issue is hard to observe, because it might only result in a less
> accurate SCHED_OTHER timeslicing behaviour. To create observable
> damage on realtime scheduling classes, it is necessary that the bogus
> update of CPU1 sched_clock happens in the context of an realtime
> thread, which then gets charged 4 seconds of RT runtime, which results
> in the RT throttler mechanism to trigger and prevent scheduling of RT
> tasks for a little less than 4 seconds. So this is quite unlikely as
> well.
>
> The issue was quite hard to decode as the reproduction time is between
> 2 days and 3 weeks and intrusive tracing makes it less likely, but the
> following trace recorded with trace_clock=global, which uses
> sched_clock_local(), gave the final hint:
>
> <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
> <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ...
> irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0
> <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80
>
> What happens is that CPU0 goes idle and invokes
> sched_clock_idle_sleep_event() which invokes sched_clock_local() and
> CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
> sched_remote_clock(). The time jump gets propagated to CPU0 via
> sched_remote_clock() and stays stale on both cores for ~4 seconds.
>
> There are only two other possibilities, which could cause a stale
> sched clock:
>
> 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
> wrong value.
>
> 2) sched_clock() which reads the TSC returns a sporadic wrong value.
>
> #1 can be excluded because sched_clock would continue to increase for
> one jiffy and then go stale.
>
> #2 can be excluded because it would not make the clock jump
> forward. It would just result in a stale sched_clock for one jiffy.
>
> After quite some brain twisting and finding the same pattern on other
> traces, sched_clock_remote() remained the only place which could cause
> such a problem and as explained above it's indeed racy on 32bit
> systems.
>
> So while on 64bit systems the readout is atomic, we need to verify the
> remote readout on 32bit machines. We need to protect the local->clock
> readout in sched_clock_remote() on 32bit as well because an NMI could
> hit between the low and the high readout, call sched_clock_local() and
> modify local->clock.
>
> Thanks to Siegfried Wulsch for bearing with my debug requests and
> going through the tedious tasks of running a bunch of reproducer
> systems to generate the debug information which let me decode the
> issue.

Ug. That looks painful.

Nice catch!

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/