Re: [PATCH 08/37] cputime: Convert task/group cputime to nsecs
From: Stanislaw Gruszka
Date: Mon Jan 30 2017 - 08:57:58 EST
On Sat, Jan 28, 2017 at 04:28:13PM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 28, 2017 at 12:57:40PM +0100, Stanislaw Gruszka wrote:
> > On 32 bit architectures 64bit store/load is not atomic and if not
> > protected - 64bit variables can be mangled. I do not see any protection
> > (lock) between utime/stime store and load in the patch and seems that
> > {u/s}time store/load can be performed at the same time. Though problem
> > is very very improbable it still can happen at least theoretically when
> > lower and upper 32 bits are changed at the same time i.e. process
> > {u,s}time become near to multiple of 2**32 nsec (aprox: 4sec) and
> > 64bit {u,s}time is stored and loaded at the same time on different
> > cpus. As said this is very improbable situation, but eventually could
> > be possible on long lived processes.
>
> "Improbable situation" doesn't appply to Linux. With millions (billion?)
> of machines using it, a rare issue in the core turns into likely to happen
> somewhere in the planet every second.
>
> So it's definetly a race we want to consider. Note it goes beyond the scope
> of this patchset as the issue was already there before since cputime_t can already
> map to u64 on 32 bits systems upstream. But this patchset definetly extends
> the issue on all 32 bits configs.
>
> kcpustat has the same issue upstream. It's is made of u64 on all configs.
I would like to add what are possible consequences if value will be
mangled. For sum_exec_runtime, utime and stime we could get wrong values
on cpu-clock related syscalls like clock_gettime() or clock_nanosleep()
and cpu-clock timers like timer_create(CLOCK_PROCESS_CPUTIME_ID) can be
triggered before or long after expected. For kcpustat this seems to be
wrong values read by procfs and 3 drivers (cpufreq, appldata, macintosh).
> > I considering fixing problem of sum_exec_runtime possible mangling
> > by using prev_sum_exec_runtime:
> >
> > u64 read_sum_exec_runtime(struct task_struct *t)
> > {
> > u64 ns, prev_ns;
> >
> > do {
> > prev_ns = READ_ONCE(t->se.prev_sum_exec_runtime);
> > ns = READ_ONCE(t->se.sum_exec_runtime);
> > } while (ns < prev_ns || ns > (prev_ns + U32_MAX));
> >
> > return ns;
> > }
> >
> > This should work based on fact that prev_sum_exec_runtime and
> > sum_exec_runtime are not modified and stored at the same time, so only
> > one of those variabled can be mangled. Though I need to think about
> > correctnes of that a bit more.
>
> I'm not sure that would be enough. READ_ONCE prevents from reordering by the
> compiler but not by the CPU. You'd need memory barriers between reads and
> writes of prev_ns and ns.
It will not be enough, this _suppose_ to work based on that sum_exec_runtime
and prev_sum_exec_runtime are not written at the same time. i.e. only
one variable can be mangled as another one sits already in the memory.
However "not written at the same time" is weak part of reasoning. Even
if those variables are stored at different part of code (sum_exec_runtime
on update_curr() and prev_sum_exec_runtime on set_next_entity()) we can
not assume store of one variable is finished before another one starts.
> WRITE ns READ prev_ns
> smp_wmb() smp_rmb()
> WRITE prev_ns READ ns
> smp_wmb() smp_rmb()
>
> It seems to be the only way to make sure that at least one of the reads
> (prev_ns or ns) is correct.
I think you have right, but seems on much code paths we have scenario:
WRITE ns READ prev_ns
smp_wmb() smp_rmb()
WRITE prev_ns READ ns
and we have already smp_wmb() after write of sum_exec_runtime on
update_min_vruntime().
Stanislaw