Re: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent

From: luwei (O)
Date: Wed Nov 02 2022 - 22:11:31 EST

Next message: Haifeng Xu: "Re: [PATCH] cgroup: Simplify code in css_set_move_task"
Previous message: Yury Norov: "Re: [PATCH 0/3] bitmap: optimize small_const path for"
In reply to: Neal Cardwell: "Re: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent"
Next in thread: Eric Dumazet: "Re: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

在 2022/11/2 10:46 PM, Neal Cardwell 写道:

On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@xxxxxxxxxx> wrote:

If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
of TCPOPT_SACK_PERM is called to enable sack after data is sent
and before data is acked, ...

This "before data is acked" phrase does not quite seem to match the
sequence below, AFAICT?

How about something like:

If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM
is called to enable SACK after data is sent and the data sender receives a
dupack, ...

yes, thanks for suggestion

... it will trigger a warning in function
tcp_verify_left_out() as follows:

============================================
WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
tcp_timeout_mark_lost+0x154/0x160
tcp_enter_loss+0x2b/0x290
tcp_retransmit_timer+0x50b/0x640
tcp_write_timer_handler+0x1c8/0x340
tcp_write_timer+0xe5/0x140
call_timer_fn+0x3a/0x1b0
__run_timers.part.0+0x1bf/0x2d0
run_timer_softirq+0x43/0xb0
__do_softirq+0xfd/0x373
__irq_exit_rcu+0xf6/0x140

The warning is caused in the following steps:
1. a socket named socketA is created
2. socketA enters repair mode without build a connection
3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
directly
4. socketA leaves repair mode
5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
ack receives) increase
6. socketA enters repair mode again
7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
increases
9. sack_outs + lost_out > packets_out triggers since lost_out and
sack_outs increase repeatly

In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
Step7 not happen and the warning will not be triggered. As suggested by
Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
can be set only if tp->segs_out is 0.

socket-tcp tests in CRIU has been tested as follows:
$ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \
--ignore-taint

socket-tcp* represent all socket-tcp tests in test/zdtm/static/.

Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
Signed-off-by: Lu Wei <luwei32@xxxxxxxxxx>
---
net/ipv4/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ef14efa1fb70..1f5cc32cf0cc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
case TCP_REPAIR_OPTIONS:
if (!tp->repair)
err = -EINVAL;
- else if (sk->sk_state == TCP_ESTABLISHED)
+ else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out)

The tp->segs_out field is only 32 bits wide. By my math, at 200
Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a
caller could get unlucky or carefully sequence its call to
TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the
accounting and trigger the kernel warning.

How about using some other method to determine if this is safe?
Perhaps using tp->bytes_sent, which is a 64-bit field, which by my
math would take 23 years to wrap at 200 Gbit/sec?

If we're more paranoid about wrapping we could also check
tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either
tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more
paranoid I suppose we could have a special new bit to track whether
we've ever sent something, but that probably seems like overkill?)

neal
.

I didn't notice that u32 will be easily wrapped in huge network throughput,
thank you neal.

But tcp->packets_out shoud not be used because tp->packets_out can decrease
when expected ack is received, so it can decrease to 0 and this is the common
condition.

--
Best Regards,
Lu Wei

Next message: Haifeng Xu: "Re: [PATCH] cgroup: Simplify code in css_set_move_task"
Previous message: Yury Norov: "Re: [PATCH 0/3] bitmap: optimize small_const path for"
In reply to: Neal Cardwell: "Re: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent"
Next in thread: Eric Dumazet: "Re: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]