Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

From: Sean Christopherson
Date: Wed Aug 25 2021 - 20:52:04 EST


On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
> [ re-send to Darren Hart ]
>
> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers mathieu.desnoyers@xxxxxxxxxxxx wrote:
>
> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@xxxxxxxxxx wrote:
> >
> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> >> migrated while the kernel is handling KVM_RUN. This is a regression test
> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> >> without updating rseq, leading to a stale CPU ID and other badness.
> >>
> >
> > [...]
> >
> > +#define RSEQ_SIG 0xdeadbeef
> >
> > Is there any reason for defining a custom signature rather than including
> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
> > the proper architecture header which will define the appropriate signature.
> >
> > Arguably you don't define rseq critical sections in this test per se, but
> > I'm wondering why the custom signature here.

Partly to avoid taking a dependency on rseq.h, and partly to try to call out that
the test doesn't actually do any rseq critical sections.

> > [...]
> >
> >> +
> >> +static void *migration_worker(void *ign)
> >> +{
> >> + cpu_set_t allowed_mask;
> >> + int r, i, nr_cpus, cpu;
> >> +
> >> + CPU_ZERO(&allowed_mask);
> >> +
> >> + nr_cpus = CPU_COUNT(&possible_mask);
> >> +
> >> + for (i = 0; i < 20000; i++) {
> >> + cpu = i % nr_cpus;
> >> + if (!CPU_ISSET(cpu, &possible_mask))
> >> + continue;
> >> +
> >> + CPU_SET(cpu, &allowed_mask);
> >> +
> >> + /*
> >> + * Bump the sequence count twice to allow the reader to detect
> >> + * that a migration may have occurred in between rseq and sched
> >> + * CPU ID reads. An odd sequence count indicates a migration
> >> + * is in-progress, while a completely different count indicates
> >> + * a migration occurred since the count was last read.
> >> + */
> >> + atomic_inc(&seq_cnt);
> >
> > So technically this atomic_inc contains the required barriers because the
> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
> > it's rather odd that the semantic differs from the kernel implementation in
> > terms of memory barriers: the kernel implementation of atomic_inc
> > guarantees no memory barriers, but this one happens to provide full
> > barriers pretty much by accident (selftests futex/include/atomic.h
> > documents no such guarantee).

Yeah, I got quite lost trying to figure out what atomics the test would actually
end up with.

> > If this full barrier guarantee is indeed provided by the selftests atomic.h
> > header, I would really like a comment stating that in the atomic.h header
> > so the carpet is not pulled from under our feet by a future optimization.
> >
> >
> >> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
> >> + errno, strerror(errno));
> >> + atomic_inc(&seq_cnt);
> >> +
> >> + CPU_CLR(cpu, &allowed_mask);
> >> +
> >> + /*
> >> + * Let the read-side get back into KVM_RUN to improve the odds
> >> + * of task migration coinciding with KVM's run loop.
> >
> > This comment should be about increasing the odds of letting the seqlock
> > read-side complete. Otherwise, the delay between the two back-to-back
> > atomic_inc is so small that the seqlock read-side may never have time to
> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
> > retry forever.

Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't
possible (though that syscall would have to be screaming fast), but the primary
motivation is very much to allow the read-side enough time to get back into KVM
proper.

To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
loop, i.e. sched_setaffinity() must induce task migration after the read-side has
invoked ioctl(KVM_RUN).

> > I'm wondering if 1 microsecond is sufficient on other architectures as
> > well.

I'm definitely wondering that as well :-)

> > One alternative way to make this depend less on the architecture's
> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
> > away the value read. This would ensure the delay is appropriate on all
> > architectures.

As above, I think an arbitrary delay is required regardless of how fast
sched_getcpu() can execute. One thought would be to do sched_getcpu() _and_
usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
but I don't know that that adds meaningful value.

The real test is if someone could see if the bug repros on non-x86 hardware...