Re: [PATCH] kernel: count warnings and make count accessible to userspace

From: Petr Mladek
Date: Wed Jan 19 2022 - 07:01:19 EST


Adding Andrew into Cc. Most changes in panic.c go via his tree.

On Tue 2022-01-18 06:04:31, John Sperbeck wrote:
> When testing, it's common to consider a warning to be a test failure,
> but it's currently awkward to determine which of multiple sequential
> tests is responsible for triggering a warning. Scraping dmesg or
> /var/log/messages is somewhat expensive and error-prone. Setting
> panic_on_warn is reliable, but spoils test runs for minor issues.
> Looking at the taint bit is also reliable, but only works for a single
> warning.
>
> We can track the warning count and expose it as a sysfs file. Test
> infrastructures can snapshot the value before and after a test. If
> the value changes, they can do more expensive things like extracting
> logs.

The counter makes sense. It might be useful even for normal debugging.
It would be nice to show the value in the log.

> Signed-off-by: John Sperbeck <jsperbeck@xxxxxxxxxx>
> ---
> kernel/panic.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/kernel/panic.c b/kernel/panic.c
> index cefd7d82366f..5262c2a0ebf4 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -571,6 +571,8 @@ struct warn_args {
> va_list args;
> };
>
> +static atomic_t __maybe_unused warn_counter;
> +
> void __warn(const char *file, int line, void *caller, unsigned taint,
> struct pt_regs *regs, struct warn_args *args)
> {
> @@ -612,6 +614,8 @@ void __warn(const char *file, int line, void *caller, unsigned taint,
>
> /* Just a warning, don't kill lockdep. */
> add_taint(taint, LOCKDEP_STILL_OK);
> +
> + atomic_inc(&warn_counter);
> }
>
> #ifndef __WARN_FLAGS
> @@ -667,6 +671,7 @@ static __init int register_warn_debugfs(void)
> /* Don't care about failure */
> debugfs_create_file_unsafe("clear_warn_once", 0200, NULL, NULL,
> &clear_warn_once_fops);
> + debugfs_create_atomic_t("warn_count", 0444, NULL, &warn_counter);

Is the sysfs interface really important for you use case, please?
Would the value in the log be enough?

Anyway, we already count the number WARN() reports. It is quite hidden
and hashed in init_oops_id()/print_oops_end_marker().

A solution would be to make this hidden counter more explicit.
Something like:

diff --git a/kernel/panic.c b/kernel/panic.c
index cefd7d82366f..8ac19124ceb4 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -537,13 +537,12 @@ void oops_enter(void)
* 64-bit random ID for oopses:
*/
static u64 oops_id;
+static int oops_cnt;

static int init_oops_id(void)
{
if (!oops_id)
get_random_bytes(&oops_id, sizeof(oops_id));
- else
- oops_id++;

return 0;
}
@@ -552,7 +551,9 @@ late_initcall(init_oops_id);
static void print_oops_end_marker(void)
{
init_oops_id();
- pr_warn("---[ end trace %016llx ]---\n", (unsigned long long)oops_id);
+ oops_cnt++;
+ pr_warn("---[ end trace %016llx:%d ]---\n",
+ (unsigned long long)oops_id, oops_cnt);
}

/*


The report would like like:

[ 1871.476204] WARNING: CPU: 2 PID: 2003 at samples/livepatch/livepatch-sample.c:60 livepatch_init+0x11/0x20 [livepatch_sample]
[ 1871.476905] Modules linked in: livepatch_sample(EK+) [last unloaded: livepatch_sample]
[ 1871.477509] CPU: 2 PID: 2003 Comm: modprobe Tainted: G W E K 5.16.0-default+ #312
[ 1871.478175] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 1871.478858] RIP: 0010:livepatch_init+0x11/0x20 [livepatch_sample]
[...]
[ 1871.489801] hardirqs last disabled at (9188): [<ffffffffb217aa4e>] vprintk_emit+0x21e/0x2b0
[ 1871.489803] softirqs last enabled at (9096): [<ffffffffb3000364>] __do_softirq+0x364/0x4ab
[ 1871.489805] softirqs last disabled at (9083): [<ffffffffb20efb5d>] irq_exit_rcu+0x10d/0x120
[ 1871.489807] ---[ end trace a19f0f55262cfcc8:2 ]---

Best Regards,
Petr