Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown

From: Shanker Donthineni
Date: Wed Jan 18 2023 - 14:24:20 EST




On 1/18/23 05:54, Marc Zyngier wrote:
External email: Use caution opening links or attachments


Shanker,

Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.

On Wed, 18 Jan 2023 02:23:48 +0000,
Shanker Donthineni <sdonthineni@xxxxxxxxxx> wrote:

Getting intermittent CPU soft lockups during the virtual machines
teardown on a system with GICv4 features enabled. The function
__synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
to be cleared forever as per the current implementation.

CPU stuck here for a long time leads to soft lockup:
while (irqd_irq_inprogress(&desc->irq_data))
cpu_relax();

Is it a soft-lockup from which the system recovers? or a livelock
which leaves the system dead?

The system is not recovering, did a power cycle to recover.

[ 211.018214] watchdog: BUG: soft lockup - CPU#8 stuck for 45s! [qemu-system-aar:1261]
[ 211.026164] Modules linked in:
[ 211.029283] CPU: 8 PID: 1261 Comm: qemu-system-aar Kdump: loaded Not tainted 6.0.0 #7
[ 211.037338] Hardware name: /th500_silicon, BIOS v1.1.1-4638c6c9 01/17/2023
[ 211.044453] pstate: 62400009 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
[ 211.051586] pc : __synchronize_hardirq+0x54/0x140
[ 211.056421] lr : free_irq+0x138/0x424
[ 211.060160] sp : ffff80001504f990
[ 211.063551] x29: ffff80001504f990 x28: 0000000000000008 x27: ffff00036c880000
[ 211.070850] x26: ffff00036f888c00 x25: 0000000000000000 x24: 0000000000000001
[ 211.078219] x23: ffff0000835659d0 x22: 0000000000000001 x21: 0000000000000000
[ 211.085513] x20: ffff00008356591c x19: ffff0000835658a0 x18: 000000000000003f
[ 211.092807] x17: ffff001feafabf28 x16: ffffffc0012c3388 x15: ffff80001504fab0
[ 211.100103] x14: 0000000000000008 x13: ffff800008b41718 x12: 00000000f5257d14
[ 211.107489] x11: 0000000000000040 x10: ffff8000091f1680 x9 : ffff8000091f1678
[ 211.114783] x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000001
[ 211.122115] x5 : ffff8000083524e4 x4 : ffffffc000db5e20 x3 : 0000000081550146
[ 211.129409] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000010071200
[ 211.136736] Call trace:
[ 211.139232] __synchronize_hardirq+0x54/0x140
[ 211.143746] free_irq+0x138/0x424
[ 211.147139] vgic_v4_teardown+0xa4/0xe0
[ 211.151072] __kvm_vgic_destroy+0x18c/0x194
[ 211.155351] kvm_vgic_destroy+0x28/0x3c
[ 211.159310] kvm_arch_destroy_vm+0x24/0x44
[ 211.163579] kvm_destroy_vm+0x158/0x2c4
[ 211.167495] kvm_vm_release+0x6c/0x98
[ 211.171257] __fput+0x70/0x220
[ 211.174373] ____fput+0x10/0x20
[ 211.177588] task_work_run+0xb4/0x23c
[ 211.181327] do_exit+0x2bc/0x8a4
[ 211.184622] do_group_exit+0x34/0xb0
[ 211.188271] get_signal+0x878/0x8a0
[ 211.191856] do_notify_resume+0x138/0x1530
[ 211.196075] el0_svc+0xdc/0xf0
[ 211.199211] el0t_64_sync_handler+0xf0/0x11c
[ 211.203635] el0t_64_sync+0x18c/0x190
[ 211.207374] Kernel panic - not syncing: softlockup: hung tasks
[ 211.213335] CPU: 8 PID: 1261 Comm: qemu-system-aar Kdump: loaded Tainted: G L 6.0.0 #7
[ 211.222851] Hardware name: /th500_silicon, BIOS v1.1.1-4638c6c9 01/17/2023
[ 211.229993] Call trace:
[ 211.232498] dump_backtrace.part.0+0xc8/0xe0
[ 211.236890] show_stack+0x18/0x54
[ 211.240330] dump_stack_lvl+0x64/0x7c
[ 211.244097] dump_stack+0x18/0x30
[ 211.247479] panic+0x160/0x328
[ 211.250622] watchdog_nmi_enable+0x0/0x14
[ 211.254818] __hrtimer_run_queues+0x118/0x244
[ 211.259288] hrtimer_interrupt+0x114/0x300
[ 211.263471] arch_timer_handler_phys+0x2c/0x40
[ 211.268034] handle_percpu_devid_irq+0x84/0x138
[ 211.272671] generic_handle_domain_irq+0x2c/0x40
[ 211.277394] gic_handle_irq+0x50/0x128
[ 211.281221] call_on_irq_stack+0x2c/0x38
[ 211.285226] do_interrupt_handler+0x7c/0x88
[ 211.289552] el1_interrupt+0x34/0x54
[ 211.293239] el1h_64_irq_handler+0x18/0x20
[ 211.297433] el1h_64_irq+0x64/0x68
[ 211.300935] __synchronize_hardirq+0x54/0x140
[ 211.305385] free_irq+0x138/0x424
[ 211.308767] vgic_v4_teardown+0xa4/0xe0
[ 211.312709] __kvm_vgic_destroy+0x18c/0x194
[ 211.317007] kvm_vgic_destroy+0x28/0x3c
[ 211.320927] kvm_arch_destroy_vm+0x24/0x44
[ 211.325204] kvm_destroy_vm+0x158/0x2c4
[ 211.329125] kvm_vm_release+0x6c/0x98
[ 211.332872] __fput+0x70/0x220
[ 211.335988] ____fput+0x10/0x20
[ 211.339192] task_work_run+0xb4/0x23c
[ 211.342980] do_exit+0x2bc/0x8a4
[ 211.346310] do_group_exit+0x34/0xb0
[ 211.349960] get_signal+0x878/0x8a0
[ 211.353520] do_notify_resume+0x138/0x1530
[ 211.357703] el0_svc+0xdc/0xf0
[ 211.360818] el0t_64_sync_handler+0xf0/0x11c
[ 211.365197] el0t_64_sync+0x18c/0x190
[ 211.368947] SMP: stopping secondary CPUs


What kernel version is that? Please provide all the relevant context
that could help analysing the issue.

Tested with v6.0+ and v6.2.rc4.


Call trace from the lockup CPU:
[ 87.238866] watchdog: BUG: soft lockup - CPU#37 stuck for 23s!
[ 87.250025] CPU: 37 PID: 1031 Comm: qemu-system-aarch64
[ 87.358397] Call trace:
[ 87.360891] __synchronize_hardirq+0x48/0x140
[ 87.365343] free_irq+0x138/0x424
[ 87.368727] vgic_v4_teardown+0xa4/0xe0
[ 87.372649] __kvm_vgic_destroy+0x18c/0x194
[ 87.376922] kvm_vgic_destroy+0x28/0x3c
[ 87.380839] kvm_arch_destroy_vm+0x24/0x44
[ 87.385024] kvm_destroy_vm+0x158/0x2c4
[ 87.388943] kvm_vm_release+0x6c/0x98
[ 87.392681] __fput+0x70/0x220
[ 87.395800] ____fput+0x10/0x20
[ 87.399005] task_work_run+0xb4/0x23c
[ 87.402746] do_exit+0x2bc/0x8a4
[ 87.406042] do_group_exit+0x34/0xb0
[ 87.409693] get_signal+0x878/0x8a0
[ 87.413254] do_notify_resume+0x138/0x1530
[ 87.417440] el0_svc+0xdc/0xf0
[ 87.420559] el0t_64_sync_handler+0xf0/0x11c
[ 87.424919] el0t_64_sync+0x18c/0x190

The state of the IRQD_IRQ_INPROGRESS information is lost inside
irq_domain_activate_irq() which happens before calling free_irq().
Instrumented the code and confirmed, the IRQD state is changed from
0x10401400 to 0x10441600 instead of 0x10401600 causing problem.

Call trace from irqd_set_activated():
[ 78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS
old=0x10401400, new=0x10441600
[ 78.992093] CPU: 19 PID: 1511 Comm: qemu-system-aarch64
[ 79.008461] Call trace:
[ 79.010956] dump_backtrace.part.0+0xc8/0xe0
[ 79.015328] show_stack+0x18/0x54
[ 79.018713] dump_stack_lvl+0x64/0x7c
[ 79.022459] dump_stack+0x18/0x30
[ 79.025842] irq_domain_activate_irq+0x88/0x94
[ 79.030385] vgic_v3_save_pending_tables+0x260/0x29c
[ 79.035463] vgic_set_common_attr+0xac/0x23c
[ 79.039826] vgic_v3_set_attr+0x48/0x60
[ 79.043742] kvm_device_ioctl+0x120/0x19c
[ 79.047840] __arm64_sys_ioctl+0x42c/0xe00
[ 79.052027] invoke_syscall.constprop.0+0x50/0xe0
[ 79.056835] do_el0_svc+0x58/0x180
[ 79.060308] el0_svc+0x38/0xf0
[ 79.063425] el0t_64_sync_handler+0xf0/0x11c
[ 79.067785] el0t_64_sync+0x18c/0x190

Are these two traces an indication of concurrent events? Or are they
far apart?

Concurrent.


irqreturn_t handle_irq_event(struct irq_desc *desc)
{
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock);

ret = handle_irq_event_percpu(desc);

raw_spin_lock(&desc->lock);
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
}

How is that relevant to this trace? Do you see this function running
concurrently with the teardown? If it matters here, it must be a VPE
doorbell, right? But you claim that this is on a GICv4 platform, while
this would only affect GICv4.1... Or are you using GICv4.1?

handle_irq_event() is running concurrently with irq_domain_activate_irq()
which happens before free_irq() called. Corruption at [78.983544] and
teardown started at [87.360891].

[ 78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600

[ 87.360891] __synchronize_hardirq+0x48/0x140

Yes, I'm using GICv4.1, used these 2 functions to trace the issue.

static inline void irqd_set_activated(struct irq_data *d)
{
atomic_t *state = (atomic_t *)(&__irqd_to_state(d));
unsigned int new, old = __irqd_to_state(d);

new = atomic_xchg(state, old | IRQD_ACTIVATED);

if ((old ^ new) & IRQD_IRQ_INPROGRESS) {
pr_err("%s: lost IRQD_IRQ_INPROGRESS old=0x%x new=0x%x\n",
__func__, old, old | IRQD_ACTIVATED);
dump_stack();
}
}

static inline void irqd_clr_activated(struct irq_data *d)
{
atomic_t *state = (atomic_t *)(&__irqd_to_state(d));
unsigned int new, old = __irqd_to_state(d);

new = atomic_xchg(state, old & ~IRQD_ACTIVATED);

if ((old ^ new) & IRQD_IRQ_INPROGRESS) {
pr_err("%s: lost IRQD_IRQ_INPROGRESS old=0x%x new=0x%x\n",
__func__, old, old & ~IRQD_ACTIVATED);
dump_stack();
}
}


In this particular failed case and based on traces, the two functions
irqd_set_activated() and handle_irq_event() are concurrently modifying
IRQD state without both holding desc->lock. The irqd_set_activated()
execution path is reading memory 'state_use_accessors' in between set
& clear of IRQD_IRQ_INPROGRESS state change and writing the modified
data after executing 'irqd_clear(desc->irq_data, IRQD_IRQ_INPROGRESS)'.

To fix the lockup issue, hold desc->lock when calling functions
irq_domain_activate_irq() and irq_domain_deactivate_irq).

For that particular issue, the main problem is that we are abusing the
interrupt startup/teardown code. The locking is only a consequence of
this.


Signed-off-by: Shanker Donthineni <sdonthineni@xxxxxxxxxx>
---
arch/arm64/kvm/vgic/vgic-v3.c | 6 ++++++
arch/arm64/kvm/vgic/vgic-v4.c | 4 ++++
2 files changed, 10 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 2074521d4a8c..e6aa909fcbe2 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -353,22 +353,28 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
static void unmap_all_vpes(struct vgic_dist *dist)
{
struct irq_desc *desc;
+ unsigned long flags;
int i;

for (i = 0; i < dist->its_vm.nr_vpes; i++) {
desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
+ raw_spin_lock_irqsave(&desc->lock, flags);
irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
+ raw_spin_unlock_irqrestore(&desc->lock, flags);

I guess this is the guilty one, based on your analysis, and assuming
this is a v4.1 issue, not v4.

}
}

static void map_all_vpes(struct vgic_dist *dist)
{
struct irq_desc *desc;
+ unsigned long flags;
int i;

for (i = 0; i < dist->its_vm.nr_vpes; i++) {
desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
+ raw_spin_lock_irqsave(&desc->lock, flags);
irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
}
}

diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index ad06ba6c9b00..a01b8313e82c 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -139,8 +139,10 @@ static void vgic_v4_enable_vsgis(struct kvm_vcpu *vcpu)
/* Transfer the full irq state to the vPE */
vgic_v4_sync_sgi_config(vpe, irq);
desc = irq_to_desc(irq->host_irq);
+ raw_spin_lock(&desc->lock);
ret = irq_domain_activate_irq(irq_desc_get_irq_data(desc),
false);
+ raw_spin_unlock(&desc->lock);

This one looks wrong. The interrupt never fires on the host (that's
the whole point of this stuff). There isn't even a handler attached to
it. How can that result in a problem?

Thanks for catching mistake.


My take on the whole thing is that we should stop playing games with
the underlying interrupt infrastructure. How about the following hack,
which is only compile-tested. Let me know if that helps.

M.

diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 2074521d4a8c..2624963cb95b 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -350,26 +350,23 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
* The deactivation of the doorbell interrupt will trigger the
* unmapping of the associated vPE.
*/
-static void unmap_all_vpes(struct vgic_dist *dist)
+static void unmap_all_vpes(struct kvm *kvm)
{
- struct irq_desc *desc;
+ struct vgic_dist *dist = &kvm->arch.vgic;
int i;

- for (i = 0; i < dist->its_vm.nr_vpes; i++) {
- desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
- irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
- }
+ for (i = 0; i < dist->its_vm.nr_vpes; i++)
+ free_irq(dist->its_vm.vpes[i]->irq, kvm_get_vcpu(kvm, i));
}

-static void map_all_vpes(struct vgic_dist *dist)
+static void map_all_vpes(struct kvm *kvm)
{
- struct irq_desc *desc;
+ struct vgic_dist *dist = &kvm->arch.vgic;
int i;

- for (i = 0; i < dist->its_vm.nr_vpes; i++) {
- desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
- irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
- }
+ for (i = 0; i < dist->its_vm.nr_vpes; i++)
+ WARN_ON(vgic_v4_request_vpe_irq(kvm_get_vcpu(kvm, i),
+ dist->its_vm.vpes[i]->irq));
}

/**
@@ -394,7 +391,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
* and enabling of the doorbells have already been done.
*/
if (kvm_vgic_global_state.has_gicv4_1) {
- unmap_all_vpes(dist);
+ unmap_all_vpes(kvm);
vlpi_avail = true;
}

@@ -444,7 +441,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)

out:
if (vlpi_avail)
- map_all_vpes(dist);
+ map_all_vpes(kvm);

return ret;
}
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index ad06ba6c9b00..a413718be92b 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -222,6 +222,11 @@ void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val)
*val = !!(*ptr & mask);
}

+int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq)
+{
+ return request_irq(irq, vgic_v4_doorbell_handler, 0, "vcpu", vcpu);
+}
+
/**
* vgic_v4_init - Initialize the GICv4 data structures
* @kvm: Pointer to the VM being initialized
@@ -283,8 +288,7 @@ int vgic_v4_init(struct kvm *kvm)
irq_flags &= ~IRQ_NOAUTOEN;
irq_set_status_flags(irq, irq_flags);

- ret = request_irq(irq, vgic_v4_doorbell_handler,
- 0, "vcpu", vcpu);
+ ret = vgic_v4_request_vpe_irq(vcpu, irq);
if (ret) {
kvm_err("failed to allocate vcpu IRQ%d\n", irq);
/*
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 0c8da72953f0..23e280fa0a16 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -331,5 +331,6 @@ int vgic_v4_init(struct kvm *kvm);
void vgic_v4_teardown(struct kvm *kvm);
void vgic_v4_configure_vsgis(struct kvm *kvm);
void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val);
+int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq);

#endif


I ran stress test launch/teardown multiple VMs for 3hrs. The issue
is not reproducible. The same test fails in 10-30min without code
changes.

--
Without deviation from the norm, progress is not possible.