Re: Bug 218665 - nohz_full=0 prevents kernel from booting

From: Linux regression tracking (Thorsten Leemhuis)
Date: Tue Apr 16 2024 - 02:08:26 EST


On 12.04.24 04:57, Bjorn Andersson wrote:
> On Wed, Apr 10, 2024 at 11:18:04AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 08.04.24 00:52, Bjorn Andersson wrote:
>>> On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>
>>>> Tejun, apparently it's cause by a change of yours.
>>>> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
>>>>
>>>>> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
>>>>> with nohz_full=0 cause a page fault and prevents the kernel from
>>>>> booting.
>>> [...]

Tejun, I got a bit lost here. Can you help me out please?

I'm currently assuming that these two reports have the same cause:
https://lore.kernel.org/all/20240402105847.GA24832@xxxxxxxxxx/T/#u
https://bugzilla.kernel.org/show_bug.cgi?id=218665

And that both will be fixed by this patch from Oleg Nesterov:
https://lore.kernel.org/lkml/20240411143905.GA19288@xxxxxxxxxx/

But well, to me it looks like below issue from Bjorn is different, even
if it is caused by the same change -- nevertheless it looks like nobody
has looked into this since it was reported about two weeks ago. Or was
progress made and I just missed it?

>>> In addition to this report, I have finally bisected another regression
>>> to the same commit:
>>>
>>> I start neovim, send SIGSTOP (i.e. ^Z) to it, start another neovim
>>> instance and upon sending SIGSTOP to that instance all of userspace
>>> locks up - 100% reproducible.
>>>
>>> The kernel seems to continue to operate, and tapping the power button
>>> dislodge the lockup and I get a clean shutdown.
>>>
>>> This is seen on multiple Arm64 (Qualcomm) machines with upstream
>>> defconfig since commit '5797b1c18919 ("workqueue: Implement system-wide
>>> nr_active enforcement for unbound workqueues")'.
>>
>> Hmmm, I had hoped Tejun would reply and share an opinion if these
>> problems are related. But that didn't happen. :-/ So let me at least ask
>> one question that might help to answer that question: is the machine
>> using CPU isolation, like the two other reports about problems caused by
>> this commit do (see the
>> https://bugzilla.kernel.org/show_bug.cgi?id=218665 and
>> https://lore.kernel.org/all/20240402105847.GA24832@xxxxxxxxxx/ for
>> details) ?
>
> No, this is a clean SMP system running stock arch/arm64/defconfig,
> booted with "clk_ignore_unused pd_ignore_unused audit=0" as the command
> line.
>
> Regards,
> Bjorn

Ciao, Thorsten