CPU POSIX timers livelock

From: Petr Tesarik
Date: Fri May 02 2008 - 11:05:55 EST


Hello,

there seems to be a possible livelock in the CPU timers (actually, we've
experienced it once already). The analysis lead to the conclusion that
fixing this properly will be a bit harder.

Problem Statement

1. Any process can set ITIMER_PROF as short as 1 timer tick
2. If the process is CPU-bound (e.g. a number-crunching application),
the timer will expire with every timer tick
3. If the process has N threads and each thread runs on its own
CPU, the timer will expire N times per timer tick

Now, this can become quite interesting on a larger system. For instance,
with 128 CPUs and a (conservative) setting of HZ 300, any process can
effectively ask to be sent a SIGPROF every 26 usec (1/300/128 s). Pretty
fast, but it can get worse if you consider the pitfalls of a NUMA
architecture.

Whenever the timer expires, a new expiration time must be computed for
each thread in the thread group. The values are stored in task_struct of
each thread, and IIRC those are kept local to the CPU where the thread
is running (quite logically). Put in another way, walking all threads
means touching remote memory except for the few task_struct which are
local to the recomputing CPU. In the 128-CPU example, if each node has 2
CPUs (think Altix), only 2 accesses are to the local node and 126
accesses are to remote memory. These are already quite expensive, but it
gets even worse. The timer interrupt is generated locally for each CPU,
so if the process is spread over all of them (see above), all CPUs
recompute the timer (hint: write access), thus constantly invalidating
local caches of the remote memory.

And those computations cannot run in parallel, of course, because
everything is serialized on sighand->siglock (see run_posix_cpu_timers).
So, each time one CPU finishes walking the threads, there's already a
queue of all the other CPUs waiting to do the same again. For this kind
of load, those 26 usec (approx. 41600 CPU cycles on a 1.6 GHz CPU) may
be well insufficient. Even more so, if somebody else occasionally takes
write lock on tasklist_lock. It may so happen that the system spends all
CPU cycles re-computing the profiling timers over and over again and
never make any progress. Note this is partly caused by the fact that the
time needed to handle the profiling timer is itself also counted to the
profiling time.

This simply does not scale for large systems (and I was not even
considering those really large 1024p ones).

Ideas

Any ideas for improvement? Obviously, storing per-process expiration
times in one place instead of dividing the interval by the number of
currently running threads would help here. But it would spoil the fast
path (when there are no timers at all). Avoiding that multiple threads
re-compute the same values might also help (although I've got no idea
how to implement this).

Any more comments (before I start hacking something you'll NAK)?

BTW did POSIX really mean to allow an indefinitely small interval for
the SIGPROF?

Petr Tesarik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/