Re: [PATCH 6/7] x86/vdso: Add vDSO functions for user wait instructions

From: Andy Lutomirski
Date: Tue Jul 24 2018 - 11:15:17 EST


On Mon, Jul 23, 2018 at 7:11 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> On 07/23/2018 05:55 AM, Fenghua Yu wrote:
>>
>> User wants to query if user wait instructions (umonitor, umwait, and
>> tpause) are supported and use the instructions. The vDSO functions
>> provides fast interface for user to check the support and use the
>> instructions.
>>
>> waitpkg_supported and its alias __vdso_waitpkg_supported check if
>> user wait instructions (a.k.a. wait package feature) are supported
>>
>> umonitor and its alias __vdso_umonitor provide user APIs for calling
>> umonitor instruction.
>>
>> umwait and its alias __vdso_umwait provide user APIs for calling
>> umwait instruction.
>>
>> tpause and its alias __vdso_tpause provide user APIs for calling
>> tpause instruction.
>>
>> nsec_to_tsc and its alias __vdso_nsec_to_tsc converts nanoseconds
>> to TSC counter if TSC frequency is known. It will fail if TSC frequency
>> is unknown.
>>
>> The instructions can be implemented in intrinsic functions in future
>> GCC. But the vDSO interfaces are available to user without the
>> intrinsic functions support in GCC and the API waitpkg_supported and
>> nsec_to_tsc cannot be implemented as GCC functions.
>>
>> Signed-off-by: Fenghua Yu <fenghua.yu@xxxxxxxxx>
>> ---
>> arch/x86/entry/vdso/Makefile | 2 +-
>> arch/x86/entry/vdso/vdso.lds.S | 10 ++
>> arch/x86/entry/vdso/vma.c | 9 ++
>> arch/x86/entry/vdso/vuserwait.c | 233
>> +++++++++++++++++++++++++++++++++
>> arch/x86/include/asm/vdso_funcs_data.h | 3 +
>> 5 files changed, 256 insertions(+), 1 deletion(-)
>> create mode 100644 arch/x86/entry/vdso/vuserwait.c
>>
>> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
>> index af4fcae5de83..fb0062b09b3c 100644
>> --- a/arch/x86/entry/vdso/Makefile
>> +++ b/arch/x86/entry/vdso/Makefile
>> @@ -17,7 +17,7 @@ VDSO32-$(CONFIG_X86_32) := y
>> VDSO32-$(CONFIG_IA32_EMULATION) := y
>> # files to link into the vdso
>> -vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o
>> +vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o
>> vuserwait.o
>> # files to link into kernel
>> obj-y += vma.o
>> diff --git a/arch/x86/entry/vdso/vdso.lds.S
>> b/arch/x86/entry/vdso/vdso.lds.S
>> index 097cdcda43a5..0942710608bf 100644
>> --- a/arch/x86/entry/vdso/vdso.lds.S
>> +++ b/arch/x86/entry/vdso/vdso.lds.S
>> @@ -35,6 +35,16 @@ VERSION {
>> __vdso_movdir64b_supported;
>> movdir64b;
>> __vdso_movdir64b;
>> + waitpkg_supported;
>> + __vdso_waitpkg_supported;
>> + umonitor;
>> + __vdso_umonitor;
>> + umwait;
>> + __vdso_umwait;
>> + tpause;
>> + __vdso_tpause;
>> + nsec_to_tsc;
>> + __vdso_nsec_to_tsc;
>> local: *;
>> };
>> }
>> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
>> index edbe5e63e5c2..006dfb5e5003 100644
>> --- a/arch/x86/entry/vdso/vma.c
>> +++ b/arch/x86/entry/vdso/vma.c
>> @@ -372,10 +372,19 @@ static int vgetcpu_online(unsigned int cpu)
>> static void __init init_vdso_funcs_data(void)
>> {
>> + struct system_counterval_t sys_counterval;
>> +
>> if (static_cpu_has(X86_FEATURE_MOVDIRI))
>> vdso_funcs_data.movdiri_supported = true;
>> if (static_cpu_has(X86_FEATURE_MOVDIR64B))
>> vdso_funcs_data.movdir64b_supported = true;
>> + if (static_cpu_has(X86_FEATURE_WAITPKG))
>> + vdso_funcs_data.waitpkg_supported = true;
>> + if (static_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
>> + vdso_funcs_data.tsc_known_freq = true;
>> + sys_counterval = convert_art_ns_to_tsc(1);
>> + vdso_funcs_data.tsc_per_nsec = sys_counterval.cycles;
>> + }
>
>
> You're losing a ton of precision here. You might even be losing *all* of
> the precision and malfunctioning rather badly.
>
> The correct way to do this is:
>
> tsc_counts = ns * mul >> shift;
>
> and the vclock code illustrates it. convert_art_ns_to_tsc() is a bad
> example because it uses an expensive division operation for no good reason
> except that no one bothered to optimize it.
>
>> +notrace int __vdso_nsec_to_tsc(unsigned long nsec, unsigned long *tsc)
>> +{
>> + if (!_vdso_funcs_data->tsc_known_freq)
>> + return -ENODEV;
>> +
>> + *tsc = _vdso_funcs_data->tsc_per_nsec * nsec;
>> +
>> + return 0;
>> +}
>
>
> Please don't expose this one at all. It would be nice for programs that use
> waitpkg to be migratable using CRIU-like tools, and this export actively
> harms any such effort. If you omit this function, then the kernel could
> learn to abort an in-progress __vdso_umwait if preempted (rseq-style) and
> CRIU would just work. It would be a bit of a hack, but it solves a real
> problem.
>
>> +notrace int __vdso_umwait(int state, unsigned long nsec)
>
>
> __vdso_umwait_relative(), please. Because some day (possibly soon) someone
> will want __vdso_umwait_absolute() and its friend __vdso_read_art_ns() so
> they can do:
>
> u64 start = __vdso_read_art_ns();
> __vdso_umonitor(...);
> ... do something potentially slow or that might fault ...
> __vdso_umwait_absolute(start + timeout);
>
> Also, this patch appears to have a subtle but show-stopping race. Consider:
>
> 1. Task A does UMONITOR on CPU 1
> 2. Task A is preempted.
> 3. Task B does UMONITOR on CPU 1 at a different address
> 4. Task A resumes
> 5. Task A does UMWAIT
>
> Now task A hangs, at least until the next external wakeup happens.
>
> It's not entirely clear to me how you're supposed to fix this without some
> abomination that's so bad that it torpedoes the entire feature. Except that
> there is no chicken bit to turn this thing off. Sigh.

The UMWAIT mechanism also looks like it will work incorrectly under a
VM. How do you (or, more generally, Intel) plan to handle that?