Re: [PATCH v3 0/2] Reduce CPU consumption after panic

From: Carlos Bilbao
Date: Wed Apr 30 2025 - 16:13:04 EST

Next message: Zi Yan: "Re: [PATCH v5 2/4] mm: document (m)THP defer usage"
Previous message: Roman Kisel: "Re: [PATCH hyperv-next v2] arch/x86: Provide the CPU number in the wakeup AP callback"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

On 4/30/25 03:48, Peter Zijlstra wrote:
> On Tue, Apr 29, 2025 at 03:52:05PM -0500, Carlos Bilbao wrote:
>> Hello,
>>
>> On 4/29/25 17:10, Peter Zijlstra wrote:
>>> On Tue, Apr 29, 2025 at 03:32:56PM -0500, Carlos Bilbao wrote:
>>>
>>>> Yes, the machine is effectively dead, but as things stand today,
>>>> it's still drawing resources unnecessarily.
>>>>
>>>> Who cares? An example, as mentioned in the cover letter, is Linux running
>>>
>>> Ah, see, I didn't have no cover letter, only akpm's reply.
>>>
>>>> in VMs. Imagine a scenario where customers are billed based on CPU usage --
>>>> having panicked VMs spinning in useless loops wastes their money. In shared
>>>> envs, those wasted cycles could be used by other processes/VMs. But this
>>>> is as much about the cloud as it is for laptops/embedded/anywhere -- Linux
>>>> should avoid wasting resources wherever possible.
>>>
>>> So I don't really buy the laptop and embedded case, people tend to look
>>> at laptops when open, and get very impatient when they don't respond.
>>> Embedded things really should have a watchdog.
>>>
>>> Also, should you not be using panic_timeout to auto reboot your machine
>>> in all these cases?
>>>
>>> In any case, the VM nonsense, do they not have a virtual watchdog to
>>> 'reap' crashed VMs or something?
>>
>> The key word here is "should." Should embedded systems have a watchdog?
>> Maybe. Should I've auto reboot set? Maybe. Perhaps I don’t want to reboot
>> until I’ve root-caused the crash.
>
> Install a kdump kernel, or log your serial line :-)
>
>> But my patch set isn’t about “shoulds.”
>> What I’m discussing here is (1) the default Linux behavior,
>
> Well, the default behaviour works for the 'your own physical machine'
> thing just fine -- and that has always been the default use-case.
>
> Nobody is going to be sitting there staring at a panic screen for ages.
>
> All the other weirdo cases like embedded and VMs, they're just that,
> weirdos and they can keep their pieces :-)
>
>> and (2)
>> providing people with the flexibility to do what THEY think they should do,
>> not what you think they should do.
>
> Well, there are a ton of options already. Like said, we have watchdogs,
> reboots, crash kernels and all sorts. Why do we need more?
>
> All that said... the default more or less does for(;;) { mdelay(100) },
> if you have a modern chip that should not end up using much power at
> all. That should end up in delay_halt_tpause() or delay_halt_mwaitx()
> (depending on you being on Intel or AMD). And spend most its time in
> deep idle states.
>
> Is something not working?

Well, in my experiments, that’s not what happened -- halting the CPU in VMs
reduced CPU usage by around 70%.

How would folks feel about adding something like
/proc/sys/kernel/halt_after_panic, disabled by default? It would help in
the Linux use cases I care about (e.g., virtualized environments), without
affecting others.

Thanks,
Carlos

Next message: Zi Yan: "Re: [PATCH v5 2/4] mm: document (m)THP defer usage"
Previous message: Roman Kisel: "Re: [PATCH hyperv-next v2] arch/x86: Provide the CPU number in the wakeup AP callback"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]