Re: [RFC PATCH v4 1/5] glibc: Perform rseq(2) registration at nptl init and thread creation

From: Mathieu Desnoyers
Date: Thu Nov 22 2018 - 12:29:47 EST




----- On Nov 22, 2018, at 11:59 AM, Florian Weimer fweimer@xxxxxxxxxx wrote:

> * Mathieu Desnoyers:
>
>> ----- On Nov 22, 2018, at 11:28 AM, Florian Weimer fweimer@xxxxxxxxxx wrote:
>>
>>> * Mathieu Desnoyers:
>>>
>>>> Here is one scenario: we have 2 early adopter libraries using rseq which
>>>> are deployed in an environment with an older glibc (which does not
>>>> support rseq).
>>>>
>>>> Of course, none of those libraries can be dlclose'd unless they somehow
>>>> track all registered threads.
>>>
>>> Well, you can always make them NODELETE so that dlclose is not an issue.
>>> If the library is small enough, that shouldn't be a problem.
>>
>> That's indeed what I do with lttng-ust, mainly due to use of pthread_key.
>>
>>>
>>>> But let's focus on how exactly those libraries can handle lazily
>>>> registering rseq. They can use pthread_key, and pthread_setspecific on
>>>> first use by the thread to setup a destructor function to be invoked
>>>> at thread exit. But each early adopter library is unaware of the
>>>> other, so if we just use a "is_initialized" flag, the first destructor
>>>> to run will unregister rseq while the second library may still be
>>>> using it.
>>>
>>> I don't think you need unregistering if the memory is initial-exec TLS
>>> memory. Initial-exec TLS memory is tied directly to the TCB and cannot
>>> be freed while the thread is running, so it should be safe to put the
>>> rseq area there even if glibc knows nothing about it.
>>
>> Is it true for user-supplied stacks as well ?
>
> I'm not entirely sure because the glibc terminology is confusing, but I
> think it places intial-exec TLS into the static TLS area (so that it has
> a fixed offset from the TCB). The static TLS area is placed on the
> user-supplied stack.

You said earlier in the email thread that user-supplied stack can be
reclaimed by __free_tcb () while the thread still runs, am I correct ?
If so, then we really want to unregister the rseq TLS before that.

I notice that __free_tcb () calls __deallocate_stack (), which invokes
_dl_deallocate_tls (). Accessing the TLS from the kernel upon preemption
would appear fragile after this call.

[...]

>> One issue here is that early adopter libraries cannot always use
>> the IE model. I tried using it for other TLS variables in lttng-ust, and
>> it ended up hanging our CI tests when tracing a sample application with
>> lttng-ust under a Java virtual machine: being dlopen'd in a process that
>> possibly already exhausts the number of available backup TLS IE entries
>> seems to have odd effects. This is why I'm worried about using the IE model
>> within lttng-ust.
>
> You can work around this by preloading the library. I'm not sure if
> this is a compelling reason not to use initial-exec TLS memory.

LTTng-UST is meant to be used as a dependency for e.g. a java logger,
or a python logger. Those rely on dlopen, and it would be very painful
to ask all users to preload lttng-ust within their environment which is
sometimes already complex. It works today through dlopen, and I consider
this a user-facing behavior which I am very reluctant to break.

>
>>>> The same problem arises if we have an application early adopter which
>>>> explicitly deal with rseq, with a library early adopter. The issue is
>>>> similar, except that the application will explicitly want to unregister
>>>> rseq before exiting the thread, which leaves a race window where rseq
>>>> is unregistered, but the library may still need to use it.
>>>>
>>>> The reference counter solves this: only the last rseq user for a thread
>>>> performs unregistration.
>>>
>>> If you do explicit unregistration, you will run into issues related to
>>> destructor ordering. You should really find a way to avoid that.
>>
>> The per-thread reference counter is a way to avoid issues that arise from
>> lack of destructor ordering. Is it an acceptable approach for you, or
>> you have something else in mind ?
>
> Only for the involved libraries. It will not help if other TLS
> destructors run and use these libraries.

You bring an interesting point. The reference counter suffice to ensure
that the kernel will not try to reference the TLS area beyond its registration
scope, but it does not guarantee that another destructor (or a signal
handler) won't try to use the rseq TLS area after it has been unregistered.

Unregistration of the TLS before freeing its memory is required for correctness.

However, a use-after-unregistration can be dealt with by other means. This
is one of the reasons why I want to upstream the "cpu_opv" system call into
Linux: this is a fallback mechanism to use when rseq cannot do forward
progress (e.g. debugger single-stepping), or to use in those scenarios
where rseq is not registered (early at thread creation, or late at thread
exit). Moreover, it allows handling use-cases of migration of data between
per-cpu data structures, which is pretty much impossible to do right if we
only have rseq available.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com