Re: [PATCH v4] posix-timers: Prefer delivery of signals to the current thread

From: Dmitry Vyukov
Date: Fri Jan 27 2023 - 01:59:10 EST


On Thu, 26 Jan 2023 at 20:57, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> On Thu, Jan 26 2023 at 18:51, Marco Elver wrote:
> > On Thu, 26 Jan 2023 at 16:41, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> >>
> >> Prefer to deliver signals to the current thread if SIGEV_THREAD_ID
> >> is not set. We used to prefer the main thread, but delivering
> >> to the current thread is both faster, and allows to sample actual thread
> >> activity for CLOCK_PROCESS_CPUTIME_ID timers, and does not change
> >> the semantics (since we queue into shared_pending, all thread may
> >> receive the signal in both cases).
> >
> > Reviewed-by: Marco Elver <elver@xxxxxxxxxx>
> >
> > Nice - and and given the test, hopefully this behaviour won't regress in future.
>
> The test does not tell much. It just waits until each thread got a
> signal once, which can take quite a while. It does not tell about the
> distribution of the signals, which can be completely randomly skewed
> towards a few threads.
>
> Also for real world use cases where you have multiple threads with
> different periods and runtime per period I have a hard time to
> understand how that signal measures anything useful.
>
> The most time consuming thread might actually trigger rarely while other
> short threads end up being the ones which cause the timer to fire.
>
> What's the usefulness of this information?
>
> Thanks,
>
> tglx

Hi Thomas,

Our goal is to sample what threads are doing in production with low
frequency and low overhead. We did not find any reasonable existing
way of doing it on Linux today, as outlined in the RFC version of the
patch (other solutions are either much more complex and/or incur
higher memory and/or CPU overheads):
https://lore.kernel.org/all/20221216171807.760147-1-dvyukov@xxxxxxxxxx/

This sampling does not need to be 100% precise as CPU profilers would
require, getting high precision generally requires more complexity and
overheads. The accent is on use in production and low overhead.
Consider we sample with O(seconds) interval, so some activities can
still be unsampled whatever we do here (if they take <second). But on
the other hand the intention is to use this over billions of CPU
hours. So on the global scale we still observe more-or-less
everything.

Currently all signals are practically delivered to the main thread and
the added test does not pass (at least I couldn't wait long enough).
After this change the test passes quickly (within a second for me).
Testing the actual distribution without flaky failures is very hard in
unit tests. After rounds of complaints and deflaking they usually
transform into roughly what this test is doing -- all threads are
getting at least something.
If we want to test ultimate fairness, we would need to start with the
scheduler itself. If threads don't get fair fractions, then signals
won't be evenly distributed as well. I am not sure if there are unit
tests for the scheduler that ensure this in all configurations (e.g.
uneven ratio of runnable threads to CPUs, running in VMs, etc).
I agree this test is not perfect, but as I said, it does not pass now.
So it is useful and will detect a future regression in this logic. It
ensures that running threads eventually get signals.

But regardless of our motivation, this change looks like an
improvement in general. Consider performance alone (why would we wake
another thread, maybe send an IPI, evict caches). Sending the signal
to the thread that overflowed the counter also looks reasonable. For
some programs it may actually give a good picture. Say thread A is
running for a prolonged time, then thread B is running. The program
will first get signals in thread A and then in thread B (instead of
getting them on an unrelated thread).