External email: Use caution opening links or attachmentsThe system is not recovering, did a power cycle to recover.
Shanker,
Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.
On Wed, 18 Jan 2023 02:23:48 +0000,
Shanker Donthineni <sdonthineni@xxxxxxxxxx> wrote:
Getting intermittent CPU soft lockups during the virtual machines
teardown on a system with GICv4 features enabled. The function
__synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
to be cleared forever as per the current implementation.
CPU stuck here for a long time leads to soft lockup:
while (irqd_irq_inprogress(&desc->irq_data))
cpu_relax();
Is it a soft-lockup from which the system recovers? or a livelock
which leaves the system dead?
What kernel version is that? Please provide all the relevant contextTested with v6.0+ and v6.2.rc4.
that could help analysing the issue.
Concurrent.
Call trace from the lockup CPU:
[ 87.238866] watchdog: BUG: soft lockup - CPU#37 stuck for 23s!
[ 87.250025] CPU: 37 PID: 1031 Comm: qemu-system-aarch64
[ 87.358397] Call trace:
[ 87.360891] __synchronize_hardirq+0x48/0x140
[ 87.365343] free_irq+0x138/0x424
[ 87.368727] vgic_v4_teardown+0xa4/0xe0
[ 87.372649] __kvm_vgic_destroy+0x18c/0x194
[ 87.376922] kvm_vgic_destroy+0x28/0x3c
[ 87.380839] kvm_arch_destroy_vm+0x24/0x44
[ 87.385024] kvm_destroy_vm+0x158/0x2c4
[ 87.388943] kvm_vm_release+0x6c/0x98
[ 87.392681] __fput+0x70/0x220
[ 87.395800] ____fput+0x10/0x20
[ 87.399005] task_work_run+0xb4/0x23c
[ 87.402746] do_exit+0x2bc/0x8a4
[ 87.406042] do_group_exit+0x34/0xb0
[ 87.409693] get_signal+0x878/0x8a0
[ 87.413254] do_notify_resume+0x138/0x1530
[ 87.417440] el0_svc+0xdc/0xf0
[ 87.420559] el0t_64_sync_handler+0xf0/0x11c
[ 87.424919] el0t_64_sync+0x18c/0x190
The state of the IRQD_IRQ_INPROGRESS information is lost inside
irq_domain_activate_irq() which happens before calling free_irq().
Instrumented the code and confirmed, the IRQD state is changed from
0x10401400 to 0x10441600 instead of 0x10401600 causing problem.
Call trace from irqd_set_activated():
[ 78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS
old=0x10401400, new=0x10441600
[ 78.992093] CPU: 19 PID: 1511 Comm: qemu-system-aarch64
[ 79.008461] Call trace:
[ 79.010956] dump_backtrace.part.0+0xc8/0xe0
[ 79.015328] show_stack+0x18/0x54
[ 79.018713] dump_stack_lvl+0x64/0x7c
[ 79.022459] dump_stack+0x18/0x30
[ 79.025842] irq_domain_activate_irq+0x88/0x94
[ 79.030385] vgic_v3_save_pending_tables+0x260/0x29c
[ 79.035463] vgic_set_common_attr+0xac/0x23c
[ 79.039826] vgic_v3_set_attr+0x48/0x60
[ 79.043742] kvm_device_ioctl+0x120/0x19c
[ 79.047840] __arm64_sys_ioctl+0x42c/0xe00
[ 79.052027] invoke_syscall.constprop.0+0x50/0xe0
[ 79.056835] do_el0_svc+0x58/0x180
[ 79.060308] el0_svc+0x38/0xf0
[ 79.063425] el0t_64_sync_handler+0xf0/0x11c
[ 79.067785] el0t_64_sync+0x18c/0x190
Are these two traces an indication of concurrent events? Or are they
far apart?
handle_irq_event() is running concurrently with irq_domain_activate_irq()
irqreturn_t handle_irq_event(struct irq_desc *desc)
{
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock);
ret = handle_irq_event_percpu(desc);
raw_spin_lock(&desc->lock);
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
}
How is that relevant to this trace? Do you see this function running
concurrently with the teardown? If it matters here, it must be a VPE
doorbell, right? But you claim that this is on a GICv4 platform, while
this would only affect GICv4.1... Or are you using GICv4.1?
In this particular failed case and based on traces, the two functions
irqd_set_activated() and handle_irq_event() are concurrently modifying
IRQD state without both holding desc->lock. The irqd_set_activated()
execution path is reading memory 'state_use_accessors' in between set
& clear of IRQD_IRQ_INPROGRESS state change and writing the modified
data after executing 'irqd_clear(desc->irq_data, IRQD_IRQ_INPROGRESS)'.
To fix the lockup issue, hold desc->lock when calling functions
irq_domain_activate_irq() and irq_domain_deactivate_irq).
For that particular issue, the main problem is that we are abusing the
interrupt startup/teardown code. The locking is only a consequence of
this.
Signed-off-by: Shanker Donthineni <sdonthineni@xxxxxxxxxx>
---
arch/arm64/kvm/vgic/vgic-v3.c | 6 ++++++
arch/arm64/kvm/vgic/vgic-v4.c | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 2074521d4a8c..e6aa909fcbe2 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -353,22 +353,28 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
static void unmap_all_vpes(struct vgic_dist *dist)
{
struct irq_desc *desc;
+ unsigned long flags;
int i;
for (i = 0; i < dist->its_vm.nr_vpes; i++) {
desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
+ raw_spin_lock_irqsave(&desc->lock, flags);
irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
I guess this is the guilty one, based on your analysis, and assuming
this is a v4.1 issue, not v4.
}
}
static void map_all_vpes(struct vgic_dist *dist)
{
struct irq_desc *desc;
+ unsigned long flags;
int i;
for (i = 0; i < dist->its_vm.nr_vpes; i++) {
desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
+ raw_spin_lock_irqsave(&desc->lock, flags);
irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
}
}
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index ad06ba6c9b00..a01b8313e82c 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -139,8 +139,10 @@ static void vgic_v4_enable_vsgis(struct kvm_vcpu *vcpu)
/* Transfer the full irq state to the vPE */
vgic_v4_sync_sgi_config(vpe, irq);
desc = irq_to_desc(irq->host_irq);
+ raw_spin_lock(&desc->lock);
ret = irq_domain_activate_irq(irq_desc_get_irq_data(desc),
false);
+ raw_spin_unlock(&desc->lock);
This one looks wrong. The interrupt never fires on the host (that's
the whole point of this stuff). There isn't even a handler attached to
it. How can that result in a problem?
My take on the whole thing is that we should stop playing games with
the underlying interrupt infrastructure. How about the following hack,
which is only compile-tested. Let me know if that helps.
M.
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 2074521d4a8c..2624963cb95b 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -350,26 +350,23 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
* The deactivation of the doorbell interrupt will trigger the
* unmapping of the associated vPE.
*/
-static void unmap_all_vpes(struct vgic_dist *dist)
+static void unmap_all_vpes(struct kvm *kvm)
{
- struct irq_desc *desc;
+ struct vgic_dist *dist = &kvm->arch.vgic;
int i;
- for (i = 0; i < dist->its_vm.nr_vpes; i++) {
- desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
- irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
- }
+ for (i = 0; i < dist->its_vm.nr_vpes; i++)
+ free_irq(dist->its_vm.vpes[i]->irq, kvm_get_vcpu(kvm, i));
}
-static void map_all_vpes(struct vgic_dist *dist)
+static void map_all_vpes(struct kvm *kvm)
{
- struct irq_desc *desc;
+ struct vgic_dist *dist = &kvm->arch.vgic;
int i;
- for (i = 0; i < dist->its_vm.nr_vpes; i++) {
- desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
- irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
- }
+ for (i = 0; i < dist->its_vm.nr_vpes; i++)
+ WARN_ON(vgic_v4_request_vpe_irq(kvm_get_vcpu(kvm, i),
+ dist->its_vm.vpes[i]->irq));
}
/**
@@ -394,7 +391,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
* and enabling of the doorbells have already been done.
*/
if (kvm_vgic_global_state.has_gicv4_1) {
- unmap_all_vpes(dist);
+ unmap_all_vpes(kvm);
vlpi_avail = true;
}
@@ -444,7 +441,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
out:
if (vlpi_avail)
- map_all_vpes(dist);
+ map_all_vpes(kvm);
return ret;
}
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index ad06ba6c9b00..a413718be92b 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -222,6 +222,11 @@ void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val)
*val = !!(*ptr & mask);
}
+int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq)
+{
+ return request_irq(irq, vgic_v4_doorbell_handler, 0, "vcpu", vcpu);
+}
+
/**
* vgic_v4_init - Initialize the GICv4 data structures
* @kvm: Pointer to the VM being initialized
@@ -283,8 +288,7 @@ int vgic_v4_init(struct kvm *kvm)
irq_flags &= ~IRQ_NOAUTOEN;
irq_set_status_flags(irq, irq_flags);
- ret = request_irq(irq, vgic_v4_doorbell_handler,
- 0, "vcpu", vcpu);
+ ret = vgic_v4_request_vpe_irq(vcpu, irq);
if (ret) {
kvm_err("failed to allocate vcpu IRQ%d\n", irq);
/*
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 0c8da72953f0..23e280fa0a16 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -331,5 +331,6 @@ int vgic_v4_init(struct kvm *kvm);
void vgic_v4_teardown(struct kvm *kvm);
void vgic_v4_configure_vsgis(struct kvm *kvm);
void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val);
+int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq);
#endif
--
Without deviation from the norm, progress is not possible.