Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

From: Mathieu Desnoyers
Date: Thu Aug 26 2021 - 14:42:21 EST


----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@xxxxxxxxxx wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>>
>> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoyers@xxxxxxxxxxxx wrote:
>>
>> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@xxxxxxxxxx wrote:
>> >
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN. This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >>
>> >
>> > [...]
>> >
>> > +#define RSEQ_SIG 0xdeadbeef
>> >
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> >
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
>
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

>
>> > [...]
>> >
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> + cpu_set_t allowed_mask;
>> >> + int r, i, nr_cpus, cpu;
>> >> +
>> >> + CPU_ZERO(&allowed_mask);
>> >> +
>> >> + nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> + for (i = 0; i < 20000; i++) {
>> >> + cpu = i % nr_cpus;
>> >> + if (!CPU_ISSET(cpu, &possible_mask))
>> >> + continue;
>> >> +
>> >> + CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> + /*
>> >> + * Bump the sequence count twice to allow the reader to detect
>> >> + * that a migration may have occurred in between rseq and sched
>> >> + * CPU ID reads. An odd sequence count indicates a migration
>> >> + * is in-progress, while a completely different count indicates
>> >> + * a migration occurred since the count was last read.
>> >> + */
>> >> + atomic_inc(&seq_cnt);
>> >
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
>
> Yeah, I got quite lost trying to figure out what atomics the test would actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I would
recommend adding a comment stating which memory barriers are required alongside each
use of atomic_inc here. I would even prefer that we add extra (currently unneeded)
write barriers to make extra sure that this stays documented. Performance of the write-side
does not matter much here.

>
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> >
>> >
>> >> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> + errno, strerror(errno));
>> >> + atomic_inc(&seq_cnt);
>> >> +
>> >> + CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> + /*
>> >> + * Let the read-side get back into KVM_RUN to improve the odds
>> >> + * of task migration coinciding with KVM's run loop.
>> >
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
>
> Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't
> possible (though that syscall would have to be screaming fast),

I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() from one
loop iteration to the next compared to the time it takes to perform a read-side critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
doubt that the sched_getcpu implementation have good odds to be fast enough to complete
in that narrow window, leading to lots of read seqlock retry.

> but the primary
> motivation is very much to allow the read-side enough time to get back into KVM
> proper.

I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
have:

migration thread KVM_RUN/read-side thread
-----------------------------------------------------------------------------------
- ioctl(KVM_RUN)
- atomic_inc_seq_cst(&seqcnt)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
- a = atomic_load(&seqcnt) & ~1
- smp_rmb()
- b = LOAD_ONCE(__rseq_abi->cpu_id);
- sched_getcpu()
- smp_rmb()
- re-load seqcnt/compare (succeeds)
- Can only succeed if entire read-side happens while the seqcnt
is in an even numbered state.
- if (a != b) abort()
/* no delay. Even counter state is very
short. */
- atomic_inc_seq_cst(&seqcnt)
/* Let's suppose the lack of delay causes the
setaffinity to complete too early compared
with KVM_RUN ioctl */
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

/* no delay. Even counter state is very
short. */
- atomic_inc_seq_cst(&seqcnt)
/* Then a setaffinity from a following
migration thread loop will run
concurrently with KVM_RUN */
- ioctl(KVM_RUN)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)

As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
a following setaffinity will run concurrently with it. However, the fact that
the even counter state is very short may very well hurt progress of the read seqlock.

>
> To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
> loop, i.e. sched_setaffinity() must induce task migration after the read-side
> has
> invoked ioctl(KVM_RUN).

No argument here.

>
>> > I'm wondering if 1 microsecond is sufficient on other architectures as
>> > well.
>
> I'm definitely wondering that as well :-)
>
>> > One alternative way to make this depend less on the architecture's
>> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
>> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
>> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
>> > away the value read. This would ensure the delay is appropriate on all
>> > architectures.
>
> As above, I think an arbitrary delay is required regardless of how fast
> sched_getcpu() can execute. One thought would be to do sched_getcpu() _and_
> usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
> but I don't know that that adds meaningful value.
>
> The real test is if someone could see if the bug repros on non-x86 hardware...

As I explain in the scenario above, I don't think we agree on the reason why the
delay is required. One way to confirm this would be to run the code without delay,
and count how many seqcnt read-side succeed vs fail. We can then compare that with
a run with a 1us delay between the migration thread loops. This data should help us
come to a common understanding of the delay's role.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com