Re: NFS and kernel 2.6.x

From: Jamie Lokier
Date: Mon Apr 19 2004 - 19:11:53 EST

Trond Myklebust wrote:
> > I agree, but would still prefer more consistent behaviour if it is
> > easy -- and I explained how to do it, it's an easy algorithm.
> The reason I don't like it is that it continues to tie the major timeout
> to the resend timeouts. You've convinced me that they should not be the
> same thing.

Sorry, I don't understand that paragraph.

The algorithm I suggested _decouples_ the major timeout from the rtt
estimate. Your algorithm strongly couples them. I'm not sure what
you mean by saying the major timeout is "tied to the resend timeouts".

Your current (patched) algorithm sets the major timeout to be in the

[timeo << retrans, (timeo << retrans) * 2]

The suggested algorithm sets the major timeout to be in the range:

[timeo << (retrans+1), (timeo << (retrans+1)) + 2 * timeo)

I.e. with retrans set to a new default of 5 (I think that's useful),
the major timeout is approx [44.8, 46] instead of [22.4, 44.8].

I agree it's not the most important thing in the world, but it is nice
to be able to fix the parameters and say that with the defaults, major
timeout happens after about 45 seconds.

You say you don't like it because major timeout is still tied to
something. Could you explain what the ideal behaviour you have in
mind is? Right now, with the patch, I think your intention is to have
a fixed major timeout time, but it doesn't work like that.

> The other reason is that it only improves matters for the first request.
> Once we reset the RTO, all those other outstanding requests are anyway
> going to see an immediate discontinuity as their basic timeout jumps
> from 1ms to 700ms.

Yes, that's the point: after a retransmits passes a threshold, we
should no longer depend on the RTO estimate because it doesn't seem to
be reliable.

> So why go to all that trouble just for 1 request?

Because it's visible behaviour with "soft" mounts. Someone unplugs
the cable or the network is down, and you see the I/O errors after
about 40 seconds. This is nicer than seeing them after an unknown
period between 40 and 80 (or 20 and 40 depending on your settings).

> It was partly another "consistency" issue that initially worried me,
> partly in order to avoid problems with overflow:
> If you have more than one outstanding request, then those that get
> scheduled after the first major timeout (when we reset the RTO
> estimator) will see a "jump". If the "retries" variable is too large,
> they will either jump straight over 60 seconds, and thus trigger the cap
> or they will end up at zero due to 32-bit overflow.

Ah. So you keep track of the number of retries per request, and each
time you send a request you set its timeout to (RTO << retries)?

If you do, maybe that's why my algorithm seems over complicated, and
you're concerned about overflows etc.

Instead of counting retries, don't. You don't need a per-request
retries counter. Instead: keep track of the request_timeout when the
request was last issued. When retransmitting, compare that value
against the global value (timeo << retrans). When a request times out
and request_timeout >= (timeo << retrans), that's a major timeout.
Otherwise you just check if request_timeout < timeo. If yes, double
it. If no, set request_timeout = timeo << N for the smallest integer
N such that it's an increase. And try again.

Notice how that logic is based on constants: it's independent of RTO,
and so outstanding requests aren't affected by changes in RTO.
There's no jump, no overflow, and you can compute the key constant
(timeo << retrans) when initialising: retrans isn't used by itself.

-- Jamie

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at