Rationale for wall clocks

From: Tomasz Buchert
Date: Fri Jul 30 2010 - 10:58:36 EST


To begin with, there are two main things that my patches concern:
A) Limited access to POSIX CPU clocks
B) Access to wall time information of a process/thread.

By CPU time I understand "user time" + "system time".

The scenerio I have, is (making the long story short) that my process
supervises a set of tasks. It "freezes" them (using freezer cgroups)
when their (CPU time) / (wall time) ratio reaches a certain threshold.
Now, to make this decision as precise as possible, I need to get a good
measurement of CPU/wall time of a task (identified by TID). If internally
the kernel time-keeping is in nanoseconds (well, at least on my x86 machine)
why shouldn't I expect to have access to it?

Let's agree at the very beginning that procfs is not feasible
to achieve that with acceptable quality. /proc/[pid]/stat
and /proc/[tid]/task/[tid]/stat expose CPU time in clock ticks
(on my machine I have sysconf(_SC_CLK_TCK) = 100 so precision is 10ms).
Start time of a process is given in a number of ticks
after the system boot and the boot time itself is given in /proc/stat in ...
a number of seconds after the beginning of Unix epoch. That's not good enough.

Ad. A)
clock_gettime is a very nice interface with nanosecond precision
(again on my x86 machine). You can ask for CPU time of a thread
or a process. And finally you can clock_nanosleep on it.
When asking for CPU time of a task, however, you can only query
tasks from your own thread group. I see no reason why this
couldn't be extended to all tasks of the same user (extending
it further could introduce potential security risks). I think
also that a root user could have the access to all clocks in the system.

This kind of information may be retrieved via taskstats anyway
(for EVERY task in the system), but with only ms precision
(because of the mentioned security problems?)

Ad. B)
As far I can tell, the only good way to obtain elapsed time
of a process/thread is to use taskstats interface. It's not THAT bad,
I agree with Stanislaw on that, it gives you some valuable pieces of information.
The precision is 2ms for the CPU time and 1us for elapsed time. In fact
with CONFIG_TASK_DELAY_ACCT enabled you can get CPU time with nanosecond precision
(it's not compiled in on my Ubuntu 9.10 kernel but it is in on one Debian machine
I have somewhere). Another exotic way to get CPU time is to use CONFIG_SCHEDSTATS
and read the first number in /proc/[tid]/schedstat. Interestingly, this is available
by default on my Ubuntu box but not in the previously mentioned Debian :).

The most portable way would be to use taskstats (it's in both kernels...:) ).
I didn't like the CPU time precision given and the whole messy code needed to use
netlink interface, though. Moreover, to get the best available precision
I would have to use POSIX clocks to get CPU time (assuming the change A would accepted!)
and taskstats to get WALL time (the precision would be however still 1us). I didn't
like this idea at all.

That's why I started to dig the kernel a little bit. After some time I found unused slot
in clockid_t which would perfectly fit an additional clock. What I like about this interface:
1) clean and simple
2) nanosecond precision
3) cheap, compared to taskstats
4) unified access to 2 important clocks of a process: CPU clock and WALL clock
The nice thing also is that you can clock_nanosleep on that clock. I have this kind
of scenario on mind: I control a process and, say, want to kill it after 1 sec
(because it is only allowed to run for that amount of time). It is easily and robustly
done with this interface: you just sleep on the WALL time clock of that process until
absolute time of 1s. Sadly, right now you can't do it precisely and correctly at the same time.

I agree that these problems could be addressed with giving the access to start_time
field, as Stanislaw suggested. Adding new fields with the same meaning but with
higher precision to taskstats is a terrible idea of course.
I simply felt, that adding a new clock type is a nice and consistent approach.

That's it.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/