Re: [PATCH v17 3/6] crash: add generic infrastructure for crash hotplug support

From: Eric DeVolder
Date: Fri Jan 20 2023 - 16:25:18 EST



On 1/19/23 15:31, Thomas Gleixner wrote:
Eric!

On Wed, Jan 18 2023 at 16:35, Eric DeVolder wrote:
CPU and memory change notifications are received in order to
regenerate the elfcorehdr.

To support cpu hotplug, a callback is registered to capture the
CPUHP_AP_ONLINE_DYN online and offline events via
cpuhp_setup_state_nocalls().

This sentence does not make sense. The callback is not registered to
capture CPUHP_AP_ONLINE_DYN events >
What this does is: It installs a dynamic CPU hotplug state with
callbacks for online and offline. These callbacks store information
about a CPU coming up and going down. Right?

I agree, the wording is wrong; this code taps into that state, as you suggest, in order to handle the online and offline events.


But why are they required and what's the value?

This changelog tells WHAT it does and not WHY. I can see the WHAT from
the patch itself.

Don't tell me the WHY is in the cover letter. The cover letter is not
part of the commits and changelogs have to be self contained.

Now let me cite from your cover letter:

When the kdump service is loaded, if a CPU or memory is hot
un/plugged, the crash elfcorehdr, which describes the CPUs
and memory in the system, must also be updated, else the resulting
vmcore is inaccurate (eg. missing either CPU context or memory
regions).
I'll work to improve the wording and why for the next iteration.


The CPU hotplug state you are using for this is patently inaccurate
too. With your approach the CPU is tracked as online very late in the
hotplug process and tracked as offline very early on unplug.

So if the kernel crashes before/after the plug/unplug tracking event
then your recorded state is bogus and given the amount of callbacks
between the real online/offline and the recording point there is a
pretty large window.

You can argue that this is better than the current state and considered
good enough for whatever reason, but such information wants to be in the
changelog, no?
I agree! I admit that CPUHP_AP_ONLINE_DYN may (is) not the best choice. I did spend time looking at the cpu hotplug infrastructure, but did not learn a better/correct way. Fwiw: https://lore.kernel.org/lkml/20211118174948.37435-1-eric.devolder@xxxxxxxxxx/:

"The second problem is the use of CPUHP_AP_ONLINE_DYN. The
cpuhp_setup_state_nocalls() is invoked with parameter
CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
the CPU still shows up in foreach_present_cpu() during the
regeneration of the elfcorehdr, thus the need to explicitly check and
exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
cpuhp_setup_state() was utilized, then the offline cpu would no longer
be in foreach_present_cpu(), and this change could be eliminated. I do
not understand cpuhp_setup_state() well enough to choose, or create,
appropriate value(s)."

The problem described (and worked around in this patch series) is the behavior/window you point out. I'd prefer to narrow the window, if possible. The states/values I tried did not work; any suggestions for a more appropriate state/value would be most welcomed!


Thanks,

tglx

Hint: The requirements for changelogs are well documented in Documentation/process/


Thomas, thank you for looking at this!
eric