Re: [RFC PATCH 2/3] restartable sequences: x86 ABI

From: Andy Lutomirski
Date: Fri Jun 26 2015 - 15:32:07 EST


On Fri, Jun 26, 2015 at 11:09 AM, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> ----- On Jun 24, 2015, at 6:26 PM, Paul Turner pjt@xxxxxxxxxx wrote:
>
>> Implements the x86 (i386 & x86-64) ABIs for interrupting and restarting
>> execution within restartable sequence sections.
>>
>> With respect to the x86-specific ABI:
>> On 32-bit: Upon restart, the interrupted rip is placed in %ecx
>> On 64-bit (or x32): Upon restart, the interrupted rip is placed in %r10
>>
>> While potentially surprising at first glance, this choice is strongly motivated
>> by the fact that the available scratch registers under the i386 function call
>> ABI overlap with those used as argument registers under x86_64.
>>
>> Given that sequences are already personality specific and that we always want
>> the arguments to be available for sequence restart, it's much more natural to
>> ultimately differentiate the ABI in these two cases.
>>
>> Signed-off-by: Paul Turner <pjt@xxxxxxxxxx>
>> ---
>> arch/x86/include/asm/restartable_sequences.h | 50 +++++++++++++++++++
>> arch/x86/kernel/Makefile | 2 +
>> arch/x86/kernel/restartable_sequences.c | 69 ++++++++++++++++++++++++++
>> arch/x86/kernel/signal.c | 12 +++++
>> kernel/restartable_sequences.c | 11 +++-
>> 5 files changed, 141 insertions(+), 3 deletions(-)
>> create mode 100644 arch/x86/include/asm/restartable_sequences.h
>> create mode 100644 arch/x86/kernel/restartable_sequences.c
>>
>> diff --git a/arch/x86/include/asm/restartable_sequences.h
>> b/arch/x86/include/asm/restartable_sequences.h
>> new file mode 100644
>> index 0000000..0ceb024
>> --- /dev/null
>> +++ b/arch/x86/include/asm/restartable_sequences.h
>> @@ -0,0 +1,50 @@
>> +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
>> +#define _ASM_X86_RESTARTABLE_SEQUENCES_H
>> +
>> +#include <asm/processor.h>
>> +#include <asm/ptrace.h>
>> +#include <linux/sched.h>
>> +
>> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
>> +
>> +static inline bool arch_rseq_in_crit_section(struct task_struct *p,
>> + struct pt_regs *regs)
>> +{
>> + struct task_struct *leader = p->group_leader;
>> + struct restartable_sequence_state *rseq_state = &leader->rseq_state;
>> +
>> + unsigned long ip = (unsigned long)regs->ip;
>> + if (unlikely(ip < (unsigned long)rseq_state->crit_end &&
>> + ip >= (unsigned long)rseq_state->crit_start))
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> +static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
>> +{
>> +#ifdef CONFIG_PREEMPT
>> + /*
>> + * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
>> + * case that we took an interrupt during syscall entry. Avoid this by
>> + * always deferring to our notify-resume handler.
>> + */
>> + return true;
>
> I'm a bit puzzled about this. If I look at perf_get_regs_user() in the perf
> code, task_pt_regs() seems to return the user-space pt_regs for a task with
> a current->mm set (iow, not a kernel thread), even if an interrupt nests on
> top of a system call. The only corner-case is NMIs, where an NMI may interrupt
> in the middle of setting up the task pt_regs, but scheduling should never happen
> there, right ?

Careful, here! task_pt_regs returns a pointer to the place where regs
would be if they were fully initialized. We can certainly take an
interrupt in the middle of pt_regs setup (entry_SYSCALL_64 enables
interrupts very early, for example). To me, the question is whether
we can ever be preemptable at such a time.

It's a bit worse, though: we can certainly be preemptible when other
code is accessing pt_regs. clone, execve, sigreturn, and signal
delivery come to mind.

Why don't we give up on poking at user state from the scheduler and do
it on exit to user mode instead? Starting in 4.3 (hopefully landing
in -tip in a week or two), we should have a nice function
prepare_exit_to_usermode that runs with well-defined state,
non-reentrantly, that can do whatever you want here, *including user
memory access*.

The remaining question would be what the ABI should be.

Could we get away with a vDSO function along the lines of "set *A=B
and *X=Y if we're on cpu N and *X=Z"? Straight-up cmpxchg would be
even simpler.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/