Re: [GIT pull] printk updates for 4.15

From: Thomas Gleixner
Date: Tue Nov 14 2017 - 15:22:07 EST


On Tue, 14 Nov 2017, Linus Torvalds wrote:
>
> But that really means that you can't just change what "ts_nsec" means
> - it's exported as "microseconds since boot".

That's true if sched clock is stable and high resolution and does not stop
during suspend.

So in reality this clock has random meanings depending on the hardware
properties. Really well "designed".

> Now, the important thing to keep in mind is that the kernel is not the
> _only_ source of these "_SOURCE_MONOTONIC_TIMESTAMP" things.
>
> In other places, systemd generates them itself, with
>
> ts->monotonic = now(CLOCK_MONOTONIC);
>
> where "now()" ends up using clock_gettime().
>
> End result? The kernel /dev/kmsg interface has to be at least
> "similar" to returning usec's comparable to CLOCK_MONOTONIC.

First of all sched_clock() and CLOCK_MONOTONIC can drift apart frequency
wise pretty bad.

Secondly, on a machine with non stop TSC sched_clock() advances by the time
the system spent in suspend, while CLOCK_MONOTONIC does not for historical
reasons. CLOCK_BOOTTIME does because that accounts for the time spent in
suspend, but then again using BOOTTIME for the user space time stamps does
not work on systems where sched_clock() stops in suspend.

So in terms of local machine correlation the only reasonable choice is
CLOCK_MONOTONIC especially when existing user space already tries to
correlate time via clock_gettime(CLOCK_MONOTONIC) based time stamps.

This works on all systems no matter which underlying hardware is used to
retrieve the timestamp. dmesg is not really a performance critical path, so
the few extra cycles which this might take more than using
local/sched_clock() are not an issue.

> This is why I would suggest just keeping the existing clock, and then
> adding clock events at _significant_ points. Somebody brought up the
> complete red herring of "every five minutes will fiill up the system
> log". That's complete BS. You do it when people change the system
> clock (and perhaps when NTP synchronizes), or after a suspend event,
> or maybe regularly every few _hours_ (and make that last part
> configurable).

I'm not worried about the system log spam, I'm more worried that we lose
exactly that last synchronization line due to buffer wrap around.

One possible solution which was discussed before is to have that extra
information in every dmesg timestamp. That means storing CLOCK_MONOTONIC as
the default timestamp and auxillary offsets for CLOCK_BOOTTIME and
CLOCK_REALTIME. Thats simple enough to achieve as we can access that
information via the NMI safe time accessors. Needs a new function, but
thats trivial.

The interesting question is how to print out that information. We can do it
unconditionally, but that might confuse existing tools. A command line
option and a sysfs knob to enable the printout of these extra bits might
make this work.

Thanks,

tglx