Re: NFS and kernel 2.6.x

From: Trond Myklebust
Date: Thu Apr 15 2004 - 21:56:05 EST


På to , 15/04/2004 klokka 18:53, skreiv Andrew Morton:
> But Charles was seeing good performance with 2.4-based clients. When he
> went to 2.6 everything fell apart.
>
> Do we know why this regression occurred?

What regression??? You have a statistic of 1 person whose 3 clients
changed from what was an apparently working setup to what has *always*
been the usual scenario for most people that tried to use the same
broken hardware/software combination whether it be in 2.2.x, 2.4.x or
2.6.x.

The whole problem is that UDP provides unreliable transport... It offers
NO guarantees that the packet will arrive at the destination.
If only 1 fragment out of the 22 that it takes to send a single
wsize=32k write request to the Sun server gets lost on the way, the
Sun's networking layer will ignore that entire packet, and so the whole
write has to time out and get resent.
Switches can usually cache a few fragments if the clients on the 100Mbit
network are sending requests at a rate that almost matches the 10Mbit
bandwidth that the Sun server supports, but if the network is swamped so
that the switch runs out of cache, then it will start to drop packets.

This is the whole reason why Sun set TCP to be their default mount
option when the changed their servers to use 32k read/write.

My biggest suspect for why this particular setup changed in 2.6.x would
therefore be the changes to the way in which writes are scheduled on the
wire. We cache them for longer, and so overall the bandwidth usage goes
down, but at the expense of more "burstiness" when the user closes the
file or does some other fsync()-like operation.



So in fact you have 2 possible workarounds:

- Use the TCP mount option (by far the better option, since TCP *does*
provide reliable transport).
- Keep UDP, but use the wsize mount option to explicitly override the
server's choice of write sizes. That works by reducing the number of
fragments per write, and so improving performance by reducing the amount
of data that need to be resent per fragment lost.


Cheers,
Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/