Re: [RFC PATCH v7 1/7] Restartable sequences system call

From: Boqun Feng
Date: Tue Aug 09 2016 - 12:13:54 EST


On Wed, Aug 03, 2016 at 10:03:32PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 3, 2016 at 9:27 PM, Boqun Feng <boqun.feng@xxxxxxxxx> wrote:
> > On Wed, Aug 03, 2016 at 09:37:57AM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 3, 2016 at 5:27 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >> > On Tue, Jul 26, 2016 at 03:02:19AM +0000, Mathieu Desnoyers wrote:
> >> >> We really care about preemption here. Every migration implies a
> >> >> preemption from a user-space perspective. If we would only care
> >> >> about keeping the CPU id up-to-date, hooking into migration would be
> >> >> enough. But since we want atomicity guarantees for restartable
> >> >> sequences, we need to hook into preemption.
> >> >
> >> >> It allows user-space to perform update operations on per-cpu data without
> >> >> requiring heavy-weight atomic operations.
> >> >
> >> > Well, a CMPXCHG without LOCK prefix isn't all that expensive on x86.
> >> >
> >> > It is however on PPC and possibly other architectures, so in name of
> >> > simplicity supporting only the one variant makes sense.
> >> >
> >>
> >> I wouldn't want to depend on CMPXCHG. But imagine we had primitives
> >> that were narrower than the full abort-on-preemption primitive.
> >> Specifically, suppose we had abort if (actual cpu != expected_cpu ||
> >> *aptr != aval). We could do things like:
> >>
> >> expected_cpu = cpu;
> >> aval = NULL; // disarm for now
> >> begin();
> >> aval = event_count[cpu] + 1;
> >> event_count[cpu] = aval;
> >> event_count[cpu]++;
> >
> > This line is redundant, right? Because it will guarantee a failure even
> > in no-contention cases.
> >
> >>
> >> ... compute something ...
> >>
> >> // arm the rest of it
> >> aptr = &event_count[cpu];
> >> if (*aptr != aval)
> >> goto fail;
> >>
> >> *thing_im_writing = value_i_computed;
> >> end();
> >>
> >> The idea here is that we don't rely on the scheduler to increment the
> >> event count at all, which means that we get to determine the scope of
> >> what kinds of access conflicts we care about ourselves.
> >>
> >
> > If we increase the event count in userspace, how could we prevent two
> > userspace threads from racing on the event_count[cpu] field? For
> > example:
> >
> > CPU 0
> > ================
> > {event_count[0] is initially 0}
> >
> > [Thread 1]
> > begin();
> > aval = event_count[cpu] + 1; // 1
> >
> > (preempted)
> > [Thread 2]
> > begin();
> > aval = event_count[cpu] + 1; // 1, too
> > event_count[cpu] = aval; // event_count[0] is 1
> >
>
> You're right :( This would work with an xadd instruction, but that's
> very slow and doesn't exist on most architectures. It could also work
> if we did:
>
> aval = some_tls_value++;
>
> where some_tls_value is set up such that no two threads could ever end
> up with the same values (using high bits as thread ids, perhaps), but
> that's messy. Maybe my idea is no good.

This is a little more complex, plus I failed to find a way to do an
atomic "if (*aptr == aval) *b = c" in userspace ;-(

However, I'm thinking maybe we can use some tricks to avoid unnecessary
aborts-on-preemption.

First of all, I notice we haven't make any constraint on what kind of
memory objects could be "protected" by rseq critical sections yet. And I
think this is something we should decide before adding this feature into
kernel.

We can do some optimization if we have some constraints. For example, if
the memory objects inside the rseq critical sections could only be
modified by userspace programs, we therefore don't need to abort
immediately when userspace task -> kernel task context switch.

Further more, if the memory objects inside the rseq critical sections
could only be modified by userspace programs that have registered their
rseq structures, we don't need to abort immediately between the context
switches between two rseq-unregistered tasks or one rseq-registered
task and one rseq-unregistered task.

Instead, we do tricks as follow:

defining a percpu pointer in kernel:

DEFINE_PER_CPU(struct task_struct *, rseq_owner);

and a cpu field in struct task_struct:

struct task_struct {
...
#ifdef CONFIG_RSEQ
struct rseq __user *rseq;
uint32_t rseq_event_counter;
int rseq_cpu;
#endif
...
};

(task_struct::rseq_cpu should be initialized as -1.)

each time at sched out(in rseq_sched_out()), we do something like:

if (prev->rseq) {
raw_cpu_write(rseq_owner, prev);
prev->rseq_cpu = smp_processor_id();
}

each time sched in(in rseq_handle_notify_resume()), we do something
like:

if (current->rseq &&
(this_cpu_read(rseq_owner) != current ||
current->rseq_cpu != smp_processor_id()))
__rseq_handle_notify_resume(regs);

(Also need to modify rseq_signal_deliver() to call
__rseq_handle_notify_resume() directly).


I think this could save some unnecessary aborts-on-preemption, however,
TBH, I'm too sleepy to verify every corner case. Will recheck this
tomorrow.

Regards,
Boqun

Attachment: signature.asc
Description: PGP signature