Re: [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()

From: Alexei Starovoitov
Date: Thu Oct 10 2019 - 14:06:39 EST


On 10/10/19 10:46 AM, Peter Zijlstra wrote:
> On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
>> On 10/10/19 12:36 AM, Peter Zijlstra wrote:
>>> On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
>>>> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
>>>> deadlock on rq_lock():
>>>>
>>>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [...]
>>>> Call Trace:
>>>> try_to_wake_up+0x1ad/0x590
>>>> wake_up_q+0x54/0x80
>>>> rwsem_wake+0x8a/0xb0
>>>> bpf_get_stack+0x13c/0x150
>>>> bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>>>> bpf_overflow_handler+0x60/0x100
>>>> __perf_event_overflow+0x4f/0xf0
>>>> perf_swevent_overflow+0x99/0xc0
>>>> ___perf_sw_event+0xe7/0x120
>>>> __schedule+0x47d/0x620
>>>> schedule+0x29/0x90
>>>> futex_wait_queue_me+0xb9/0x110
>>>> futex_wait+0x139/0x230
>>>> do_futex+0x2ac/0xa50
>>>> __x64_sys_futex+0x13c/0x180
>>>> do_syscall_64+0x42/0x100
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>
>>>
>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>>> index 052580c33d26..3b278f6b0c3e 100644
>>>> --- a/kernel/bpf/stackmap.c
>>>> +++ b/kernel/bpf/stackmap.c
>>>> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>>> bool irq_work_busy = false;
>>>> struct stack_map_irq_work *work = NULL;
>>>>
>>>> - if (in_nmi()) {
>>>> + if (in_nmi() || this_rq_is_locked()) {
>>>> work = this_cpu_ptr(&up_read_work);
>>>> if (work->irq_work.flags & IRQ_WORK_BUSY)
>>>> /* cannot queue more up_read, fallback */
>>>
>>> This is horrific crap. Just say no to that get_build_id_offset()
>>> trainwreck.
>>
>> this is not a helpful comment.
>> What issues do you see with this approach?
>
> It will still generate deadlocks if I place a tracepoint inside a lock
> that nests inside rq->lock, and it won't ever be able to detect that.
> Say do the very same thing on trace_hrtimer_start(), which is under
> cpu_base->lock, which nests inside rq->lock. That should give you an
> AB-BA.
>
> tracepoints / perf-overflow should _never_ take locks.
>
> All of stack_map_get_build_id_offset() is just disguisting games; I did
> tell you guys how to do lockless vma lookups a few years ago -- and yes,
> that is invasive core mm surgery. But this is just disguisting hacks for
> not wanting to do it right.

you mean speculative page fault stuff?
That was my hope as well and I offered Laurent all the help to land it.
Yet after a year since we've talked the patches are not any closer
to landing.
Any other 'invasive mm surgery' you have in mind?

> Basically the only semi-sane thing to do with that trainwreck is
> s/in_nmi()/true/ and pray.
>
> On top of that I just hate buildids in general.

Emotions aside... build_id is useful and used in production.
It's used widely because it solves real problems.
This dead lock is from real servers and not from some sanitizer wannabe.
Hence we need to fix it as cleanly as possible and quickly.
s/in_nmi/true/ is certainly an option.
I'm worried about overhead of doing irq_work_queue() all the time.
But I'm not familiar with mechanism enough to justify the concerns.
Would it make sense to do s/in_nmi/irgs_disabled/ instead?