Re: [RFC PATCH v7 1/7] Restartable sequences system call

From: Andy Lutomirski
Date: Wed Aug 10 2016 - 15:07:43 EST


On Tue, Aug 9, 2016 at 9:13 AM, Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
> On Wed, Aug 03, 2016 at 10:03:32PM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 3, 2016 at 9:27 PM, Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
>> > On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> >> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
>> >> >> We really care about preemption here. Every migration implies a
>> >> >> preemption from a user-space perspective. If we would only care
>> >> >> about keeping the CPU id up-to-date, hooking into migration would be
>> >> >> enough. But since we want atomicity guarantees for restartable
>> >> >> sequences, we need to hook into preemption.
>> >> >
>> >> >> It allows user-space to perform update operations on per-cpu data without
>> >> >> requiring heavy-weight atomic operations.
>> >> >
>> >> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
>> >> >
>> >> > It is however on PPC and possibly other architectures, so in name of
>> >> > simplicity supporting only the one variant makes sense.
>> >> >
>> >>
>> >> I wouldn't want to depend on CMPXCHG. But imagine we had primitives
>> >> that were narrower than the full abort-on-preemption primitive.
>> >> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
>> >> *aptr != aval). We could do things like:
>> >>
>> >> expected_cpu = cpu;
>> >> aval = NULL; // disarm for now
>> >> begin();
>> >> aval = event_count[cpu] + 1;
>> >> event_count[cpu] = aval;
>> >> event_count[cpu]++;
>> >
>> > This line is redundant, right? Because it will guarantee a failure even
>> > in no-contention cases.
>> >
>> >>
>> >> ... compute something ...
>> >>
>> >> // arm the rest of it
>> >> aptr = &event_count[cpu];
>> >> if (*aptr != aval)
>> >> goto fail;
>> >>
>> >> *thing_im_writing = value_i_computed;
>> >> end();
>> >>
>> >> The idea here is that we don't rely on the scheduler to increment the
>> >> event count at all, which means that we get to determine the scope of
>> >> what kinds of access conflicts we care about ourselves.
>> >>
>> >
>> > If we increase the event count in userspace, how could we prevent two
>> > userspace threads from racing on the event_count[cpu] field? For
>> > example:
>> >
>> > CPU 0
>> > ================
>> > {event_count[0] is initially 0}
>> >
>> > [Thread 1]
>> > begin();
>> > aval = event_count[cpu] + 1; // 1
>> >
>> > (preempted)
>> > [Thread 2]
>> > begin();
>> > aval = event_count[cpu] + 1; // 1, too
>> > event_count[cpu] = aval; // event_count[0] is 1
>> >
>>
>> You're right :( This would work with an xadd instruction, but that's
>> very slow and doesn't exist on most architectures. It could also work
>> if we did:
>>
>> aval = some_tls_value++;
>>
>> where some_tls_value is set up such that no two threads could ever end
>> up with the same values (using high bits as thread ids, perhaps), but
>> that's messy. Maybe my idea is no good.
>
> This is a little more complex, plus I failed to find a way to do an
> atomic "if (*aptr == aval) *b = c" in userspace ;-(
>

But the kernel might be able to help using something similar to this patchset.

> However, I'm thinking maybe we can use some tricks to avoid unnecessary
> aborts-on-preemption.
>
> First of all, I notice we haven't make any constraint on what kind of
> memory objects could be "protected" by rseq critical sections yet. And I
> think this is something we should decide before adding this feature into
> kernel.
>
> We can do some optimization if we have some constraints. For example, if
> the memory objects inside the rseq critical sections could only be
> modified by userspace programs, we therefore don't need to abort
> immediately when userspace task -> kernel task context switch.

True, although trying to do a syscall in an rseq critical section
seems like a bad idea in general.

>
> Further more, if the memory objects inside the rseq critical sections
> could only be modified by userspace programs that have registered their
> rseq structures, we don't need to abort immediately between the context
> switches between two rseq-unregistered tasks or one rseq-registered
> task and one rseq-unregistered task.
>
> Instead, we do tricks as follow:
>
> defining a percpu pointer in kernel:
>
> DEFINE_PER_CPU(struct task_struct *, rseq_owner);
>
> and a cpu field in struct task_struct:
>
> struct task_struct {
> ...
> #ifdef CONFIG_RSEQ
> struct rseq __user *rseq;
> uint32_t rseq_event_counter;
> int rseq_cpu;
> #endif
> ...
> };
>
> (task_struct::rseq_cpu should be initialized as -1.)
>
> each time at sched out(in rseq_sched_out()), we do something like:
>
> if (prev->rseq) {
> raw_cpu_write(rseq_owner, prev);
> prev->rseq_cpu = smp_processor_id();
> }
>
> each time sched in(in rseq_handle_notify_resume()), we do something
> like:
>
> if (current->rseq &&
> (this_cpu_read(rseq_owner) != current ||
> current->rseq_cpu != smp_processor_id()))
> __rseq_handle_notify_resume(regs);
>
> (Also need to modify rseq_signal_deliver() to call
> __rseq_handle_notify_resume() directly).
>
>
> I think this could save some unnecessary aborts-on-preemption, however,
> TBH, I'm too sleepy to verify every corner case. Will recheck this
> tomorrow.

Interesting. That could help a bit, although it would help less if
everyone started using rseq.

But do we need to protect MAP_SHARED objects? If not, maybe we could
only track context switches between different tasks sharing the same
mm.

--Andy