[PATCH AUTOSEL 6.3 06/22] bpf: tcp: Avoid taking fast sock lock in iterator

From: Sasha Levin
Date: Sun Jul 09 2023 - 11:18:06 EST

Next message: Sasha Levin: "[PATCH AUTOSEL 6.1 18/18] net: hns3: fix strncpy() not using dest-buf length as length issue"
Previous message: Sasha Levin: "[PATCH AUTOSEL 6.1 16/18] wifi: iwlwifi: pcie: add device id 51F1 for killer 1675"
In reply to: Sasha Levin: "[PATCH AUTOSEL 6.3 08/22] devlink: make health report on unregistered instance warn just once"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx>

[ Upstream commit 9378096e8a656fb5c4099b26b1370c56f056eab9 ]

This is a preparatory commit to replace `lock_sock_fast` with
`lock_sock`,and facilitate BPF programs executed from the TCP sockets
iterator to be able to destroy TCP sockets using the bpf_sock_destroy
kfunc (implemented in follow-up commits).

Previously, BPF TCP iterator was acquiring the sock lock with BH
disabled. This led to scenarios where the sockets hash table bucket lock
can be acquired with BH enabled in some path versus disabled in other.
In such situation, kernel issued a warning since it thinks that in the
BH enabled path the same bucket lock *might* be acquired again in the
softirq context (BH disabled), which will lead to a potential dead lock.
Since bpf_sock_destroy also happens in a process context, the potential
deadlock warning is likely a false alarm.

Here is a snippet of annotated stack trace that motivated this change:

```

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&h->lhash2[i].lock);
local_bh_disable();
lock(&h->lhash2[i].lock);
kernel imagined possible scenario:
local_bh_disable(); /* Possible softirq */
lock(&h->lhash2[i].lock);
*** Potential Deadlock ***

process context:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH enabled
__inet_hash+0x4b/0x210
inet_csk_listen_start+0xe6/0x100
inet_listen+0x95/0x1d0
__sys_listen+0x69/0xb0
__x64_sys_listen+0x14/0x20
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

bpf_sock_destroy run from iterator:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH disabled
inet_unhash+0x9a/0x110
tcp_set_state+0x6a/0x210
tcp_abort+0x10d/0x200
bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
bpf_iter_run_prog+0x1ff/0x340
------> lock_sock_fast that acquires sock lock with BH disabled
bpf_iter_tcp_seq_show+0xca/0x190
bpf_seq_read+0x177/0x450

```

Also, Yonghong reported a deadlock for non-listening TCP sockets that
this change resolves. Previously, `lock_sock_fast` held the sock spin
lock with BH which was again being acquired in `tcp_abort`:

```
watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331]
RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500
Call Trace:
<TASK>
_raw_spin_lock+0x84/0x90
tcp_abort+0x13c/0x1f0
bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89
bpf_iter_run_prog+0x1aa/0x2c0
? preempt_count_sub+0x1c/0xd0
? from_kuid_munged+0x1c8/0x210
bpf_iter_tcp_seq_show+0x14e/0x1b0
bpf_seq_read+0x36c/0x6a0

bpf_iter_tcp_seq_show
lock_sock_fast
__lock_sock_fast
spin_lock_bh(&sk->sk_lock.slock);
/* * Fast path return with bottom halves disabled and * sock::sk_lock.slock held.* */

...
tcp_abort
local_bh_disable();
spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk)

```

With the switch to `lock_sock`, it calls `spin_unlock_bh` before returning:

```
lock_sock
lock_sock_nested
spin_lock_bh(&sk->sk_lock.slock);
:
spin_unlock_bh(&sk->sk_lock.slock);
```

Acked-by: Yonghong Song <yhs@xxxxxxxx>
Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
Signed-off-by: Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/r/20230519225157.760788-2-aditi.ghag@xxxxxxxxxxxxx
Signed-off-by: Martin KaFai Lau <martin.lau@xxxxxxxxxx>
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---
net/ipv4/tcp_ipv4.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c87958f979f0a..470c8427a48c2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2963,7 +2963,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
struct bpf_iter_meta meta;
struct bpf_prog *prog;
struct sock *sk = v;
- bool slow;
uid_t uid;
int ret;

@@ -2971,7 +2970,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
return 0;

if (sk_fullsock(sk))
- slow = lock_sock_fast(sk);
+ lock_sock(sk);

if (unlikely(sk_unhashed(sk))) {
ret = SEQ_SKIP;
@@ -2995,7 +2994,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)

unlock:
if (sk_fullsock(sk))
- unlock_sock_fast(sk, slow);
+ release_sock(sk);
return ret;

}
--
2.39.2

Next message: Sasha Levin: "[PATCH AUTOSEL 6.1 18/18] net: hns3: fix strncpy() not using dest-buf length as length issue"
Previous message: Sasha Levin: "[PATCH AUTOSEL 6.1 16/18] wifi: iwlwifi: pcie: add device id 51F1 for killer 1675"
In reply to: Sasha Levin: "[PATCH AUTOSEL 6.3 08/22] devlink: make health report on unregistered instance warn just once"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]