Re: [PATCH v1] mm/gup: remove (VM_)BUG_ONs

From: Vlastimil Babka
Date: Mon Jun 09 2025 - 05:57:59 EST


On 6/7/25 8:00 PM, John Hubbard wrote:
> On 6/7/25 6:53 AM, Lorenzo Stoakes wrote:
>>
>> Well that is simpler :)
>>
>> I have encountered situations where I've had more than one and needed
>> 2nd+
>> but it is rare as you say.
>>
>> My late night incoherent babbling yesterday was perhaps because I
>> misunderstood David/John as to what they encountered in the past... maybe
>> they can clarify...
>
> I've debugged lots of production systems, often these were large HPC
> clusters and supercomputers. I've seen:
>
> a) Long up-times, with (of course!) relatively small dmesg buffer sizes,
> so that early logs are long gone. This means that WARN_ON_ONCE() is
> quite often gone (overwritten). This is common.

There's no e.g. journald storing them permanently? I think trying to
hard in the kernel to provide this "recall first warning" if userspace
can handle this, is suboptimal. I think there are two main scenarios:

- the warning is indeed not fatal - userspace can likely save it
- it's (part of) something fatal - the system will crash before it
disappears from the ring buffer

> The worst part is that if you go to reproduce a problem, you don't
> see the next warning in the logs!! This is devastating, especially if
> the site makes it hard to ask for a system reboot. (If you have
> ~20,000 nodes in the cluster, a reboot is not a small affair.)

Assuming you know how to reproduce the problem... I wonder if it would
help if there was a way (sysctl?) to re-arm all the _ONCE warnings. It
shouldn't be that hard hopefully?