Re: [PATCH v2 bpf-next 5/5] bpf: trampoline: support FTRACE_OPS_FL_SHARE_IPMODIFY

From: Song Liu
Date: Wed Jul 06 2022 - 22:11:56 EST




> On Jul 6, 2022, at 6:18 PM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Thu, 7 Jul 2022 00:19:07 +0000
> Song Liu <songliubraving@xxxxxx> wrote:
>
>>>> In this specific race condition, register_bpf() will succeed, as it already
>>>> got tr->mutex. But the IPMODIFY (livepatch) side will fail and retry.
>>>
>>> What else takes the tr->mutex ?
>>
>> tr->mutex is the local mutex for a single BPF trampoline, we only need to take
>> it when we make changes to the trampoline (add/remove fentry/fexit programs).
>>
>>>
>>> If it preempts anything else taking that mutex, when this runs, then it
>>> needs to be careful.
>>>
>>> You said this can happen when the live patch came first. This isn't racing
>>> against live patch, it's racing against anything that takes the tr->mutex
>>> and then adds a bpf trampoline to a location that has a live patch.
>>
>> There are a few scenarios here:
>> 1. Live patch is already applied, then a BPF trampoline is being registered
>> to the same function. In bpf_trampoline_update(), register_fentry returns
>> -EAGAIN, and this will be resolved.
>
> Where will it be resolved?

bpf_trampoline_update() will set tr->indirect_call and goto again. Then the
trampoline is re-prepared to be able to share with the IPMODIFY functions
and register_fentry will succeed.

>
>>
>> 2. BPF trampoline is already registered, then a live patch is being applied
>> to the same function. The live patch code need to ask the bpf trampoline to
>> prepare the trampoline before live patch. This is done by calling
>> bpf_tramp_ftrace_ops_func.
>>
>> 2.1 If nothing else is modifying the trampoline at the same time,
>> bpf_tramp_ftrace_ops_func will succeed.
>>
>> 2.2 In rare cases, if something else is holding tr->mutex to make changes to
>> the trampoline (add/remove fentry functions, etc.), mutex_trylock in
>> bpf_tramp_ftrace_ops_func will fail, and live patch will fail. However, the
>> change to BPF trampoline will still succeed. It is common for live patch to
>> retry, so we just need to try live patch again when no one is making changes
>> to the BPF trampoline in parallel.
>
> If the live patch is going to try again, and the task doing the live
> patch is SCHED_FIFO, and the task holding the tr->mutex is SCHED_OTHER
> (or just a lower priority), then there is a chance that the live patch
> task preempted the tr->mutex owner, and let's say the tr->mutex owner
> is pinned to the CPU (by the user or whatever), then because the live
> patch task is in a loop trying to take that mutex, it will never let
> the owner continue.

Yeah, I got this scenario. I just don't think we should run live patch
with high priority. Well, maybe we shouldn't make such assumptions.

>
> Yes, this is a real scenario with trylock on mutexes. We hit it all the
> time in RT.
>
>>
>>>
>>>>
>>>> Since both livepatch and bpf trampoline changes are rare operations, I think
>>>> the chance of the race condition is low enough.
>
>
> A low race condition in a world that does this a billion times a day,
> ends up being not so rare.
>
> I like to say, "I live in a world where the unlikely is very much likely!"
>
>
>>>>
>>>> Does this make sense?
>>>>
>>>
>>> It's low, and if it is also a privileged operation then there's less to be
>>> concern about.
>>
>> Both live patch and BPF trampoline are privileged operations.
>
> This makes the issue less of an issue, but if there's an application
> that does this with setuid or something, there's a chance that it can
> be used by an attacker as well.
>
>>
>>> As if it is not, then we could have a way to deadlock the
>>> system. I'm more concerned that this will lead to a CVE than it just
>>> happening randomly. In other words, it only takes something that can run at
>>> a real-time priority to connect to a live patch location, and something
>>> that runs at a low priority to take a tr->mutex. If an attacker has both,
>>> then it can pin both to a CPU and then cause the deadlock to the system.
>>>
>>> One hack to fix this is to add a msleep(1) in the failed case of the
>>> trylock. This will at least give the owner of the lock a millisecond to
>>> release it. This was what the RT patch use to do with spin_trylock() that
>>> was converted to a mutex (and we worked hard to remove all of them).
>>
>> The fix is really simple. But I still think we don't need it. We only hit
>> the trylock case for something with IPMODIFY. The non-privileged user
>> should not be able to do that, right?
>
> For now, perhaps. But what useful applications are there going to be in
> the future that performs these privileged operations on behalf of a
> non-privileged user?
>
> In other words, if we can fix it now, we should, and avoid debugging
> this issue 5 years from now where it may take months to figure out.

Fair enough.. I guess we will just add the msleep(1) in the -EAGAIN
path. If this sounds good, I will send v3 with this change and more
comments.

Thanks,
Song