Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

From: Eric Dumazet
Date: Tue Jul 10 2018 - 08:39:04 EST




On 07/09/2018 11:51 PM, Jon Maxwell wrote:
> v3 contains the following suggestions by Neal Cardwell:
>
> 1) Fix up units mismatch regarding msec/jiffies.
> 2) Address possiblility of time_remaining being negative.
> 3) Add a helper routine tcp_clamp_rto_to_user_timeout() to do the rto
> calculation.
> 4) Move start_ts logic into helper routine tcp_retrans_stamp() to
> validate tcp_sk(sk)->retrans_stamp.
> 5) Some u32 declation and return refactoring.
> 6) Return 0 instead of false in tcp_retransmit_stamp(), it's not a bool.
>
> Suggestions by David Laight:
>
> 1) Don't cache rto in tcp_clamp_rto_to_user_timeout().
> 2) Use conditional operator instead of min_t() in
> tcp_clamp_rto_to_user_timeout()
>
> Changes:
>
> 1) Call tcp_clamp_rto_to_user_timeout(sk) as an argument to
> inet_csk_reset_xmit_timer() to save on rto declaration.
>
> Every time the TCP retransmission timer fires. It checks to see if there is a
> timeout before scheduling the next retransmit timer. The retransmit interval
> between each retransmission increases exponentially. The issue is that in order
> for the timeout to occur the retransmit timer needs to fire again. If the user
> timeout check happens after the 9th retransmit for example. It needs to wait for
> the 10th retransmit timer to fire in order to evaluate whether a timeout has
> occurred or not. If the interval is large enough then the timeout will be
> inaccurate.
>
> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>
> 1st retransmit:
>
> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Sun Jul 1 22:25:34 EDT 2018
>
> We can see that last retransmit took ~7 seconds. Which pushed the total
> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>
> Add tcp_clamp_rto_to_user_timeout() to determine if the user rto has expired.
> Or whether the rto interval needs to be recalculated. Use the original interval
> if user rto is not set.
>
> Test results with the patch is the expected 10 second timeout:
>
> 1st retransmit:
>
> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Mon Jul 2 01:38:09 EDT 2018
>
> Signed-off-by: Jon Maxwell <jmaxwell37@xxxxxxxxx>
> ---
> net/ipv4/tcp_timer.c | 49 +++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3b3611729928..93239e58776d 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -22,6 +22,38 @@
> #include <linux/gfp.h>
> #include <net/tcp.h>
>
> +u32 tcp_retransmit_stamp(struct sock *sk)

const struct sock *sk;

(To clearly express the fact this helper does not touch the socket)

> +{
> + u32 start_ts = tcp_sk(sk)->retrans_stamp;
> +
> + if (unlikely(!start_ts)) {
> + struct sk_buff *head = tcp_rtx_queue_head(sk);
> +
> + if (!head)
> + return 0;
> + start_ts = tcp_skb_timestamp(head);
> + }
> + return start_ts;
> +}
> +
> +static __u32 tcp_clamp_rto_to_user_timeout(struct sock *sk)

const struct sock *sk

> +{
> + struct inet_connection_sock *icsk = inet_csk(sk);
> + __u32 elapsed, user_timeout;
> + u32 start_ts;

Mixing __u32 and u32 in new code is confusing.

__u32 are needed in uapi include files, not in the C kernel code.

u32 elapsed, user_timeout, start_ts;

> +
> + start_ts = tcp_retransmit_stamp(sk);
> + if (!icsk->icsk_user_timeout || !start_ts)
> + return icsk->icsk_rto;
> + elapsed = tcp_time_stamp(tcp_sk(sk)) - start_ts;
> + user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);

Note that if we always do jiffies_to_msecs(icsk->icsk_user_timeout) in TCP,
we also could change the convention and store msecs in this field instead of jiffies.

That would eliminate the msecs_to_jiffies() and jiffies_to_msecs() dance.

(That would be done in a patch of its own, of course)

> + if (elapsed >= user_timeout)
> + return 1; /* user timeout has passed; fire ASAP */
> + else
> + return (icsk->icsk_rto < msecs_to_jiffies(user_timeout - elapsed)) ?
> + icsk->icsk_rto : msecs_to_jiffies(user_timeout - elapsed);


return min_t(i32, icsk->icsk_rto, msecs_to_jiffies(user_timeout - elapsed));

> +}
> +
> /**
> * tcp_write_err() - close socket and save error info
> * @sk: The socket the error has appeared on.
> @@ -161,19 +193,15 @@ static bool retransmits_timed_out(struct sock *sk,
> unsigned int timeout)
> {
> const unsigned int rto_base = TCP_RTO_MIN;
> - unsigned int linear_backoff_thresh, start_ts;
> + unsigned int linear_backoff_thresh;
> + u32 start_ts;

This seems a gratuitous change, making your patch bigger than necessary.

u32 and unsigned int are the same essentially.

>
> if (!inet_csk(sk)->icsk_retransmits)
> return false;
>
> - start_ts = tcp_sk(sk)->retrans_stamp;
> - if (unlikely(!start_ts)) {
> - struct sk_buff *head = tcp_rtx_queue_head(sk);
> -
> - if (!head)
> - return false;
> - start_ts = tcp_skb_timestamp(head);
> - }
> + start_ts = tcp_retransmit_stamp(sk);
> + if (!start_ts)
> + return false;

One thing you could do is to make this refactoring in a single patch,
and provide a patch series, with small units, making the review easier.

>
> if (likely(timeout == 0)) {
> linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base);
> @@ -535,7 +563,8 @@ void tcp_retransmit_timer(struct sock *sk)
> /* Use normal (exponential) backoff */
> icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> }
> - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> + tcp_clamp_rto_to_user_timeout(sk), TCP_RTO_MAX);
> if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1 + 1, 0))
> __sk_dst_reset(sk);
>
>

Thanks !