Re: [PATCH v2] cpu/hotplug: Do not bail-out in DYING/STARTING sections

From: Thorsten Leemhuis
Date: Mon Jul 04 2022 - 06:04:28 EST


On 13.06.22 15:37, Vincent Donnefort wrote:
> On Mon, Jun 13, 2022 at 02:36:18PM +0200, Thomas Gleixner wrote:
>> Vincent,
>>
>> On Mon, May 23 2022 at 17:05, Vincent Donnefort wrote:
>>> +static int _cpuhp_invoke_callback_range(bool bringup,
>>> + unsigned int cpu,
>>> + struct cpuhp_cpu_state *st,
>>> + enum cpuhp_state target,
>>> + bool nofail)
>>> {
>>> enum cpuhp_state state;
>>> - int err = 0;
>>> + int ret = 0;
>>>
>>> while (cpuhp_next_state(bringup, &state, st, target)) {
>>> + int err;
>>> +
>>> err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
>>> - if (err)
>>> + if (!err)
>>> + continue;
>>> +
>>> + if (nofail) {
>>> + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
>>> + cpu, bringup ? "UP" : "DOWN",
>>> + cpuhp_get_step(st->state)->name,
>>> + st->state, err);
>>> + ret = -1;
>>
>> I have a hard time to map this to the changelog:
>>
>>> those sections. In that case, there's nothing the hotplug machinery can do,
>>> so let's just proceed and log the failures.
>>
>> That's still returning an error code at the end. Confused.
>
> It is, but after returning from this function, only a warning will be raised
> (cpuhp_invoke_callback_range_nofail()) instead of stopping the HP machinery
> (cpuhp_invoke_callback_range()). How about this changelog?
>
> The DYING/STARTING callbacks are not expected to fail. However, as reported by
> Derek, drivers such as tboot are still free to return errors within those
> sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable
> state.
>
> No rollback being possible there, let's only log the failures and proceed
> with the following steps. This restores the hotplug behaviour prior to
> 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())

Vincent, what's up here? Did that patch make it further? It looks to me
like things stalled here, but maybe I'm missing something. I'm asking
because that fix was supposed to fix a regression I'm tracking.

BTW, if you respin this patch, could you please add proper 'Link:' tags
pointing to all reports about this issue? e.g. like this:

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867

These tags are important, as they allow others to look into the
backstory now and years from now. That is why they should be placed in
cases like this, as Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst explain in more detail.
Additionally, my regression tracking bot ‘regzbot’ relies on these tags
to automatically connect reports with patches that are posted or
committed to fix the reported issue.

Ciao, Thorsten