Re: [RFC patch 00/15] Tracer Timestamping

From: Nicolas Pitre
Date: Mon Oct 20 2008 - 22:33:00 EST

On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:

> * Nicolas Pitre (nico@xxxxxxx) wrote:
> > On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> >
> > > * Peter Zijlstra (a.p.zijlstra@xxxxxxxxx) wrote:
> > > > Have you looked at the existing 32->63 extention code in
> > > > include/linux/cnt32_to_63.h and considered unifying it?
> > > >
> > >
> > > Yep, I felt this code was dangerous on SMP given it could suffer from
> > > the following type of race due to lack of proper barriers :
> >
> > You are wrong.
> >
> > > CPU A B
> > > read hw cnt low
> > > read __m_cnt_hi
> > > read hw cnt low
> > > (wrap detected)
> > > write __m_cnt_hi (incremented)
> > > read __m_cnt_hi
> > > (wrap detected)
> > > write __m_cnt_hi (incremented)
> > >
> > > we therefore increment the high bits twice in the given race.
> >
> > No. Either you do the _same_ incrementation twice effectively writing
> > the _same_ high value twice, or you don't have simultaneous wrap
> > detections. Simulate it on paper with real numbers if you are not
> > convinced.
> >
> Hi Nicolas,
> Yup, you are right. However, the case where one CPU sees the clock source
> a little bit off-sync (late) still poses a problem. Example follows :
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> return 0x0000000100000001
> read __m_cnt_hi (0x00000001)
> (late) read hw cnt low (0xFFFFFFFA)
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)

If this is through a global counter then I have a big problem believing

If this is a per-CPU counter then you just need a per-CPU version of

> A similar situation can be generated by out-of-order hi/low bits reads.

This, of course, should and can be prevented. No big deal.

> > If the low part is a per CPU value then the high part has to be a per
> > CPU value too. There only needs to be a per-CPU variant of the same
> > algorithm where the only difference is that __m_cnt_hi would be per CPU.
> >
> > If the low part is global then __m_cnt_hi has to be global, even on SMP.
> >
> Hrm, given the performance impact of barriers that would need to be
> added to insure correct read order of hi and low bits, and also
> considering the risk that a "synchronized" timestamp counter off by a
> few cycles poses in terms of time jumping forward, I don't see why we
> would ever want to use a global __m_cnt_hi ?

I still would like to know just _how_ you could manage to have one CPU
perform two reads and one write and manage for another CPU to see that
just written value but read an even older value from the global counter
than what the first CPU saw.

> I would therefore recommend to always use a per-cpu __m_cnt_hi.

No, not until you convince me of the above nonsense. Sure it will
work with a per-CPU __m_cnt_hi of course, but then the timing constraint
to keep the algorithm warm is for each CPU which, if the source counter
is global, shouldn't be necessary.

If the 63-bit counter is queried often enough on each CPU then it is of
course advantageous to go per-CPU to avoid excessive cache bouncing, and
if you start worrying about cache bouncing then you certainly call this
code often enough to keep everything in synch. But that needs not to be
always the case.

> Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
> with a stale old __m_cnt_hi value, its comments should probably say that
> it has to be always used with preemption off.

LOCKING! The only requirement is that you must not be preempted away
for longer than half (or a quarter if you want to play real safe) the
period of the base counter. If you cannot guarantee that then either
your base counter is so fast that it wraps so often to be unusable, or
you have another and bigger problem for being preempted for so long.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at