Re: [PATCH 0/5] improve handling of errors returned by kthread_park()
From: Ulrich Obergfell
Date: Wed Sep 30 2015 - 06:54:25 EST
Andrew,
> ... what inspired this patchset?
> Are you experiencing kthread_park() failures in practice?
I did not experience kthread_park() failures in practice. Looking at
watchdog_park_threads() from 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28
I realized that there is a theoretical corner case which would not be
handled well. Let's assume that kthread_park() would return an error
in the following flow of execution (the user changes watchdog_thresh).
proc_watchdog_thresh
set_sample_period()
//
// The watchdog_thresh and sample_period variable are now set to
// the new value.
//
proc_watchdog_update
watchdog_enable_all_cpus
update_watchdog_all_cpus
watchdog_park_threads
Let's say the system has eight CPUs and that kthread_park() failed to
park watchdog/4. In this example watchdog/0 .. watchdog/3 are already
parked and watchdog/5 .. watchdog/7 are not parked yet (we don't know
exactly what happened to watchdog/4). watchdog_park_threads() unparks
the threads if kthread_park() of one thread fails.
for_each_watchdog_cpu(cpu) {
ret = kthread_park(per_cpu(softlockup_watchdog, cpu));
if (ret)
break;
}
if (ret) {
for_each_watchdog_cpu(cpu)
kthread_unpark(per_cpu(softlockup_watchdog, cpu));
}
watchdog/0 .. watchdog/3 will pick up the new watchdog_thresh value
when they are unparked (please see the watchdog_enable() function),
whereas watchdog/5 .. watchdog/7 will continue to use the old value
for the hard lockup detector and begin using the new value for the
soft lockup detector (kthread_unpark() sees watchdog/5 .. watchdog/7
in the unparked state, so it skips these threads). The inconsistency
which results from using different watchdog_thresh values can cause
unexpected behaviour of the lockup detectors (e.g. false positives).
The new error handling that is introduced by this patch set aims to
handle the above corner case in a better way (this was my original
motivation to come up with a patch set). However, I also think that
_if_ kthread_park() would ever be changed in the future so that it
could return errors under various (other) conditions, the patch set
should prepare the watchdog code for this possibility.
Since I did not experience kthread_park() failures in practice, I
used some instrumentation to fake error returns from kthread_park()
in order to test the patches.
Regards,
Uli
----- Original Message -----
From: "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>
To: "Ulrich Obergfell" <uobergfe@xxxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx, dzickus@xxxxxxxxxx, atomlin@xxxxxxxxxx
Sent: Wednesday, September 30, 2015 1:30:36 AM
Subject: Re: [PATCH 0/5] improve handling of errors returned by kthread_park()
On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell <uobergfe@xxxxxxxxxx> wrote:
> The original watchdog_park_threads() function that was introduced by
> commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
> approach to handle errors returned by kthread_park(): It attempts to
> roll back all watchdog threads to the unparked state. However, this
> may be undesired behaviour from the perspective of the caller which
> may want to handle errors as appropriate in its specific context.
> Currently, there are two possible call chains:
>
> - watchdog suspend/resume interface
>
> lockup_detector_suspend
> watchdog_park_threads
>
> - write to parameters in /proc/sys/kernel
>
> proc_watchdog_update
> watchdog_enable_all_cpus
> update_watchdog_all_cpus
> watchdog_park_threads
>
> Instead of 'blindly' attempting to unpark the watchdog threads if a
> kthread_park() call fails, the new approach is to disable the lockup
> detectors in the above call chains. Failure becomes visible to the
> user as follows:
>
> - error messages from lockup_detector_suspend()
> or watchdog_enable_all_cpus()
>
> - the state that can be read from /proc/sys/kernel/watchdog_enabled
>
> - the 'write' system call in the latter call chain returns an error
>
hm, you made me look at kthread parking. Why does it exist? What is a
"parked" thread anyway, and how does it differ from, say, a sleeping
one? The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the
patch added no useful documentation, sigh.
Anwyay... what inspired this patchset? Are you experiencing
kthread_park() failures in practice? If so, what is causing them? And
what is the user-visible effect of these failures? This is all pretty
important context for such a patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/