Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline

From: Jeffrey Hugo
Date: Sun Aug 20 2017 - 15:31:42 EST


On 6/29/2017 6:18 PM, Paul E. McKenney wrote:
On Thu, Jun 29, 2017 at 10:29:12AM -0600, Jeffrey Hugo wrote:
On 6/27/2017 6:11 PM, Paul E. McKenney wrote:
On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote:
On 6/22/2017 9:34 PM, Paul E. McKenney wrote:
On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote:
No worries, and I am very much looking forward to seeing the results of
your testing.

And please see below for an updated patch based on LKML review and
more intensive testing.


I spent some time on this today. It didn't go as I expected. I
validated the issue is reproducible as before on 4.11 and 4.12 rcs 1
through 4. However, the version of stress-ng that I was using ran
into constant errors starting with rc5, making it nearly impossible
to make progress toward reproduction. Upgrading stress-ng to tip
fixes the issue, however, I've still been unable to repro the issue.

Its my unfounded suspicion that something went in between rc4 and
rc5 which changed the timing, and didn't actually fix the issue. I
will run the test overnight for 5 hours to try to repro.

The patch you sent appears to be based on linux-next, and appears to
have a number of dependencies which prevent it from cleanly applying
on anything current that I'm able to repro on at this time. Do you
want to provide a rebased version of the patch which applies to say
4.11? I could easily test that and report back.

Here is a very lightly tested backport to v4.11.


Works for me. Always reproduced the lockup within 2 minutes on stock
4.11. With the change applied, I was able to test for 2 hours in
the same conditions, and 4 hours with the full system and not
encounter an issue.

Feel free to add:
Tested-by: Jeffrey Hugo <jhugo@xxxxxxxxxxxxxx>

Applied, thank you!

I'm going to go back to 4.12-rc5 and see if I can get either repro
the issue, or identify what changed. Hopefully I can get to
linux-next and double check the original version of the change as
well.

Looking forward to hearing what you find!

Thanx, Paul


According to git bisect, the following is what "changed"

commit 9d0eb4624601ac978b9e89be4aeadbd51ab2c830
Merge: 5faab9e 9bc1f09
Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Sun Jun 11 11:07:25 2017 -0700

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
"Bug fixes (ARM, s390, x86)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: async_pf: avoid async pf injection when in guest mode
KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation
arm: KVM: Allow unaligned accesses at HYP
arm64: KVM: Allow unaligned accesses at EL2
arm64: KVM: Preserve RES1 bits in SCTLR_EL2
KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
KVM: nVMX: Fix exception injection
kvm: async_pf: fix rcu_irq_enter() with irqs enabled
KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
KVM: s390: fix ais handling vs cpu model
KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration

Nothing really stands out to me which would "fix" the issue.

--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.