Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown

From: Shanker Donthineni
Date: Thu Jan 19 2023 - 08:01:34 EST




On 1/19/23 01:11, Marc Zyngier wrote:
External email: Use caution opening links or attachments


[dropping the now dead old kvmarm list]

On Wed, 18 Jan 2023 19:24:01 +0000,
Shanker Donthineni <sdonthineni@xxxxxxxxxx> wrote:



On 1/18/23 05:54, Marc Zyngier wrote:
External email: Use caution opening links or attachments


Shanker,

Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.

On Wed, 18 Jan 2023 02:23:48 +0000,
Shanker Donthineni <sdonthineni@xxxxxxxxxx> wrote:

Getting intermittent CPU soft lockups during the virtual machines
teardown on a system with GICv4 features enabled. The function
__synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
to be cleared forever as per the current implementation.

CPU stuck here for a long time leads to soft lockup:
while (irqd_irq_inprogress(&desc->irq_data))
cpu_relax();

Is it a soft-lockup from which the system recovers? or a livelock
which leaves the system dead?

The system is not recovering, did a power cycle to recover.

This isn't a soft-lockup then. This is at best a livelock.

Are these two traces an indication of concurrent events? Or are they
far apart?

Concurrent.

So you can see the VM being torn down while the vgic save sequence is
still in progress?

If you can actually see that, then this is a much bigger bug than the
simple race you are describing, and we're missing a reference on the
kvm structure. This would be a *MAJOR* bug.

How do we know vGIC save sequence is in progress while VM is being teardown?
I'm launching/terminating ~32 VMs in a loop to reproduce the issue.
Please post the full traces, not snippets. The absolutely full kernel
log, the configuration, what you run, how you run it, *EVERYTHING*. I
need to be able to reproduce this.
Sure, I'll share the complete boot log messages of host kernel next run.



irqreturn_t handle_irq_event(struct irq_desc *desc)
{
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
raw_spin_unlock(&desc->lock);

ret = handle_irq_event_percpu(desc);

raw_spin_lock(&desc->lock);
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
}

How is that relevant to this trace? Do you see this function running
concurrently with the teardown? If it matters here, it must be a VPE
doorbell, right? But you claim that this is on a GICv4 platform, while
this would only affect GICv4.1... Or are you using GICv4.1?

handle_irq_event() is running concurrently with irq_domain_activate_irq()
which happens before free_irq() called. Corruption at [78.983544] and
teardown started at [87.360891].

But that doesn't match the description you made of concurrent
events. Does it take more than 9 seconds for the vgic state to be
saved to memory?

Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
state bit other than concurrent accesses?

I did another experiment, changed the bit manipulation of the IRQD
state to atomic operations, the issue is not reproducible.

struct irq_common_data {
- unsigned int __private state_use_accessors;
+ atomic_t __private state_use_accessors;

+#define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
+#define __irqd_get_state(d) atomic_read(&__irqd_to_state(d))
+#define __irqd_set_state(d, m) atomic_or((m), &__irqd_to_state(d))
+#define __irqd_clear_state(d, m) atomic_andnot((m), &__irqd_to_state(d))


[ 78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600

[ 87.360891] __synchronize_hardirq+0x48/0x140

Yes, I'm using GICv4.1, used these 2 functions to trace the issue.

Then *please* be precise in your descriptions. You send people in the
wrong direction.

It was my mistake not to mention the exact GIC version.

[...]

I ran stress test launch/teardown multiple VMs for 3hrs. The issue
is not reproducible. The same test fails in 10-30min without code
changes.

That doesn't add up with the earlier description of concurrent events,
with the VM teardown and the vgic saving running in parallel. My patch
doesn't prevent this.

So either your testing is insufficient, or your description of
concurrent events is inaccurate.


I'm using the unmodified 6.2.rc3/v6.0 from kernel.org and QEMU to
reproduce the issue.

Using the below steps for launching/terminating 32 VMs in loop. The
failure is intermittent. The same issue is reproducible with KVMTOOL
also.

numvms=32
socketcnt=1
iterations=100
while [ ${iterations} -ne 0 ]; do
iterations=$(( iterations - 1 ))
for idx in `seq 1 ${numvms}`; do
vmname=vm${idx}
cpunode=$((idx % socketcnt))
cpuset=`lscpu | grep "NUMA node${cpunode} CPU(s)" | awk -F' ' '{ print $4 }'`
qemu-img create -f qcow2 -F raw -b /var/lib/libvirt/images/cloudimg.raw /var/lib/libvirt/images/${vmname}.qcow2
virt-sysprep -a /var/lib/libvirt/images/${vmname}.qcow2 --hostname ${vmname}
virt-install \
--name ${vmname} \
--numatune ${cpunode},mode=preferred \
--vcpus vcpus=16,maxvcpus=16,cpuset=${cpuset} \
--memory 4096 \
--import \
--noautoconsole \
--os-variant name=ubuntu22.04 \
--disk path=/var/lib/libvirt/images/${vmname}.qcow2 \
--boot loader=/usr/share/AAVMF/AAVMF_CODE.fd,loader_ro=yes,loader_type=pflash
if [ $? -ne 0 ]; then
echo "Failed to create vm${idx}"
while true ; do sleep 1; done
fi
sleep 5
done

for idx in `seq 1 ${numvms}`; do
echo -n "Checking vm${idx} boot: "
while true; do
ghostname=`virsh guestinfo vm${idx} --hostname 2>/dev/null | awk '{ print $3 }'`
[ "${ghostname}" == "vm${idx}" ] && echo "done" && break
sleep 1
done
done

for idx in `seq 1 ${numvms}`; do
virsh domstate vm${idx} >/dev/null 2>&1 || continue
echo "Started shutdown & undefine vm${idx}..."
virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
while true; do
virsh domstate vm${idx} 2>/dev/null 2>&1 | grep -w "^shut off" >/dev/null 2>&1
[ $? -eq 0 ] && break
sleep 1
virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
done
virsh undefine --nvram --remove-all-storage vm${idx} 2>/dev/null
done
done


M.

--
Without deviation from the norm, progress is not possible.