Re: [RFC patch 00/15] Tracer Timestamping

From: Mathieu Desnoyers
Date: Mon Oct 20 2008 - 21:32:39 EST

* Nicolas Pitre (nico@xxxxxxx) wrote:
> On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (a.p.zijlstra@xxxxxxxxx) wrote:
> > > Have you looked at the existing 32->63 extention code in
> > > include/linux/cnt32_to_63.h and considered unifying it?
> > >
> >
> > Yep, I felt this code was dangerous on SMP given it could suffer from
> > the following type of race due to lack of proper barriers :
> You are wrong.
> > CPU A B
> > read hw cnt low
> > read __m_cnt_hi
> > read hw cnt low
> > (wrap detected)
> > write __m_cnt_hi (incremented)
> > read __m_cnt_hi
> > (wrap detected)
> > write __m_cnt_hi (incremented)
> >
> > we therefore increment the high bits twice in the given race.
> No. Either you do the _same_ incrementation twice effectively writing
> the _same_ high value twice, or you don't have simultaneous wrap
> detections. Simulate it on paper with real numbers if you are not
> convinced.

Hi Nicolas,

Yup, you are right. However, the case where one CPU sees the clock source
a little bit off-sync (late) still poses a problem. Example follows :

read __m_cnt_hi (0x80000000)
read hw cnt low (0x00000001)
(wrap detected :
(s32)(0x80000000 ^ 0x1) < 0)
write __m_cnt_hi = 0x00000001
return 0x0000000100000001
read __m_cnt_hi (0x00000001)
(late) read hw cnt low (0xFFFFFFFA)
(wrap detected :
(s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
write __m_cnt_hi = 0x80000001
return 0x80000001FFFFFFFA
(time jumps)
read __m_cnt_hi (0x80000001)
read hw cnt low (0x00000020)
(wrap detected :
(s32)(0x80000001 ^ 0x20) < 0)
write __m_cnt_hi = 0x00000002
return 0x0000000200000020
(time jumps again)

A similar situation can be generated by out-of-order hi/low bits reads.

> > On UP, the same race could happen if the code is called with preemption
> > enabled.
> Wrong again.
> > I don't think the "volatile" statement would necessarily make sure the
> > compiler and CPU would do the __m_cnt_hi read before the hw cnt low
> > read. A real memory barrier to order mmio reads wrt memory reads (or
> > instruction sync barrier if the value is taken from the cpu registers)
> > would be required to insure such order.
> That's easy enough to fix, right?

For memory read vs cycle counter, something like the "data_barrier(x)"
on powerpc would probably be enough when available. We would need
something that orders memory reads with the rest of instruction
execution. On other architectures, smp_rmb() + get_cycles_barrier() (the
latter being an instruction sync) would be required.

For memory read vs mmap io read, a smp_rmb() should be enough.

> > I also felt it would be more solid to have per-cpu structures to keep
> > track of 32->64 bits TSC updates, given the TSCs can always be slightly
> > out-of-sync :
> If the low part is a per CPU value then the high part has to be a per
> CPU value too. There only needs to be a per-CPU variant of the same
> algorithm where the only difference is that __m_cnt_hi would be per CPU.
> If the low part is global then __m_cnt_hi has to be global, even on SMP.

Hrm, given the performance impact of barriers that would need to be
added to insure correct read order of hi and low bits, and also
considering the risk that a "synchronized" timestamp counter off by a
few cycles poses in terms of time jumping forward, I don't see why we
would ever want to use a global __m_cnt_hi ? I would therefore recommend
to always use a per-cpu __m_cnt_hi.

Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
with a stale old __m_cnt_hi value, its comments should probably say that
it has to be always used with preemption off.


> Nicolas

Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at