Re: [PATCH 4.14 04/33] tcp: be more careful in tcp_fragment()

From: Matthieu Baerts
Date: Tue Aug 20 2019 - 12:45:46 EST


Hi Eric,

On 08/08/2019 21:05, Greg Kroah-Hartman wrote:
> commit b617158dc096709d8600c53b6052144d12b89fab upstream.
>
> Some applications set tiny SO_SNDBUF values and expect
> TCP to just work. Recent patches to address CVE-2019-11478
> broke them in case of losses, since retransmits might
> be prevented.
>
> We should allow these flows to make progress.
>
> This patch allows the first and last skb in retransmit queue
> to be split even if memory limits are hit.
>
> It also adds the some room due to the fact that tcp_sendmsg()
> and tcp_sendpage() might overshoot sk_wmem_queued by about one full
> TSO skb (64KB size). Note this allowance was already present
> in stable backports for kernels < 4.15
>
> Note for < 4.15 backports :
> tcp_rtx_queue_tail() will probably look like :
>
> static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
> {
> struct sk_buff *skb = tcp_send_head(sk);
>
> return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
> }
>
> Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
> Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> Reported-by: Andrew Prout <aprout@xxxxxxxxxx>
> Tested-by: Andrew Prout <aprout@xxxxxxxxxx>
> Tested-by: Jonathan Lemon <jonathan.lemon@xxxxxxxxx>
> Tested-by: Michal Kubecek <mkubecek@xxxxxxx>
> Acked-by: Neal Cardwell <ncardwell@xxxxxxxxxx>
> Acked-by: Yuchung Cheng <ycheng@xxxxxxxxxx>
> Acked-by: Christoph Paasch <cpaasch@xxxxxxxxx>
> Cc: Jonathan Looney <jtl@xxxxxxxxxxx>
> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
> Signed-off-by: Matthieu Baerts <matthieu.baerts@xxxxxxxxxxxx>
> Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> ---
> include/net/tcp.h | 17 +++++++++++++++++
> net/ipv4/tcp_output.c | 11 ++++++++++-
> 2 files changed, 27 insertions(+), 1 deletion(-)

I am sorry to bother you again with the recent modifications in
tcp_fragment() but it seems we have a new kernel BUG with this patch in
v4.14.

Here is the call trace.


[26665.934461] ------------[ cut here ]------------
[26665.936152] kernel BUG at ./include/linux/skbuff.h:1406!
[26665.937941] invalid opcode: 0000 [#1] SMP PTI
[26665.977252] Call Trace:
[26665.978267] <IRQ>
[26665.979163] tcp_fragment+0x9c/0x2cf
[26665.980562] tcp_write_xmit+0x68f/0x988
[26665.982031] __tcp_push_pending_frames+0x3b/0xa0
[26665.983684] tcp_data_snd_check+0x2a/0xc8
[26665.985196] tcp_rcv_established+0x2a8/0x30d
[26665.986736] tcp_v4_do_rcv+0xb2/0x158
[26665.988140] tcp_v4_rcv+0x692/0x956
[26665.989533] ip_local_deliver_finish+0xeb/0x169
[26665.991250] __netif_receive_skb_core+0x51c/0x582
[26665.993028] ? inet_gro_receive+0x239/0x247
[26665.994581] netif_receive_skb_internal+0xab/0xc6
[26665.996340] napi_gro_receive+0x8a/0xc0
[26665.997790] receive_buf+0x9a1/0x9cd
[26665.999232] ? load_balance+0x17a/0x7b7
[26666.000711] ? vring_unmap_one+0x18/0x61
[26666.002196] ? detach_buf+0x60/0xfa
[26666.003526] virtnet_poll+0x128/0x1e1
[26666.004860] net_rx_action+0x12a/0x2b1
[26666.006309] __do_softirq+0x11c/0x26b
[26666.007734] ? handle_irq_event+0x44/0x56
[26666.009275] irq_exit+0x61/0xa0
[26666.010511] do_IRQ+0x9d/0xbb
[26666.011685] common_interrupt+0x85/0x85


We are doing the tests with the MPTCP stack[1], the error might come
from there but the call trace is free of MPTCP functions. We are still
working on having a reproducible setup with MPTCP before doing the same
without MPTCP but please see below the analysis we did so far with some
questions.

[1] https://github.com/multipath-tcp/mptcp/tree/mptcp_v0.94

> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 0b477a1e11770..7994e569644e0 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1688,6 +1688,23 @@ static inline void tcp_check_send_head(struct sock *sk, struct sk_buff *skb_unli
> tcp_sk(sk)->highest_sack = NULL;
> }
>
> +static inline struct sk_buff *tcp_rtx_queue_head(const struct sock *sk)
> +{
> + struct sk_buff *skb = tcp_write_queue_head(sk);
> +
> + if (skb == tcp_send_head(sk))
> + skb = NULL;
> +
> + return skb;
> +}
> +
> +static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
> +{
> + struct sk_buff *skb = tcp_send_head(sk);
> +
> + return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
> +}
> +
> static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
> {
> __skb_queue_tail(&sk->sk_write_queue, skb);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a5960b9b6741c..a99086bf26eaf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1264,6 +1264,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
> struct tcp_sock *tp = tcp_sk(sk);
> struct sk_buff *buff;
> int nsize, old_factor;
> + long limit;
> int nlen;
> u8 flags;
>
> @@ -1274,7 +1275,15 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
> if (nsize < 0)
> nsize = 0;
>
> - if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 0x20000)) {
> + /* tcp_sendmsg() can overshoot sk_wmem_queued by one full size skb.
> + * We need some allowance to not penalize applications setting small
> + * SO_SNDBUF values.
> + * Also allow first and last skb in retransmit queue to be split.
> + */
> + limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE);
> + if (unlikely((sk->sk_wmem_queued >> 1) > limit &&
> + skb != tcp_rtx_queue_head(sk) &&

If the skb returned by tcp_rtx_queue_head() is null -- because skb ==
tcp_send_head(sk), please see above -- should we not do something else
than going to the next condition?

> + skb != tcp_rtx_queue_tail(sk))) {

It seems that behind, tcp_write_queue_prev() can be called -- please see
above -- which will directly called skb_queue_prev() which does this:

/* This BUG_ON may seem severe, but if we just return then we
* are going to dereference garbage.
*/
BUG_ON(skb_queue_is_first(list, skb));
return skb->prev;

Should we do the same check before to avoid the BUG_ON()?

Could it be normal to hit this BUG_ON() with regular TCP or is it due to
other changes in MPTCP stack? We hope not having bothered you for
something not in the upstream kernel (yet) :)

Cheers,
Matt

> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> return -ENOMEM;
> }
>

--
Matthieu Baerts | R&D Engineer
matthieu.baerts@xxxxxxxxxxxx
Tessares SA | Hybrid Access Solutions
www.tessares.net
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium