possible nfsv3 write corruption

From: Pallissard, Matthew
Date: Thu Feb 27 2020 - 11:28:49 EST



Forgive me if this is the wrong list.

Ok, I have this super infrequent data corruption on write that seems to be limited to nfsv3 async mounts. I have not tested nfsv4 yet. I _think_ I've narrowed down to the 5.5.0 > X >= 5.1.4 (maybe earlier) kernels. I had some users report they had random data corruption. A bit of testing shows that it's reproducible and the corruption is nearly identical every time.

I'd like to get to the bottom of this so I can guarantee that a kernel upgrade will resolve the issue.

What winds up happening is every several hundred GiB[ish] we wind up with the first half of a 64 bit segment corrupted. Here is some example output from a test. My test writes a few Gib, alternating between 64 bits of `0`'s and 64 bits of `1`'s. I then read it in and check the contents. Re-reading the file shows that it's corrupted on write, not read.

> 2020-02-14 11:04:34 crit found mis-match on word segment 11911168 / 33554432!
> 2020-02-14 11:04:34 crit found mis-match on byte 7, 188 != 255
> 2020-02-14 11:04:34 crit found mis-match on byte 6, 0 != 255
> 2020-02-14 11:04:34 crit found mis-match on byte 5, 16 != 255
> 2020-02-14 11:04:34 crit found mis-match on byte 4, 128 != 255
> 2020-02-14 11:04:34 crit 1011110000000000000100001000000011111111111111111111111111111111

> 2020-02-14 13:38:11 crit found mis-match on word segment 1982464 / 33554432!
> 2020-02-14 13:38:11 crit found mis-match on byte 7, 188 != 255
> 2020-02-14 13:38:11 crit found mis-match on byte 6, 0 != 255
> 2020-02-14 13:38:11 crit found mis-match on byte 5, 16 != 255
> 2020-02-14 13:38:11 crit found mis-match on byte 4, 128 != 255
> 2020-02-14 13:38:11 crit 1011110000000000000100001000000011111111111111111111111111111111


Knowns;

* does not appear to happen on CentOS/EL 3.10 series kernel

* does not appear to happen on a 5.5 series kernel
* I'm re-running all my tests now to confirm this.

* not hardware dependent

* not processor dependent
* I tested 3 different Intel processors

* appears to only happen on NFS v3 async mounts
* local disk and `-o sync` NFS v3 mounts have been tested

* It happens on random 64 bit segments

* It's *always* the same 4 bytes that are corrupted

* While often identical, the corrupted bytes are not always identical
* the identical corruption pattern can appear on separate computers.

* It's *always* on words that are written with `1`'s <- this is the part I find most interesting

* whether or not I explicitly call `fflush` and `sync` has no effect on the results.

* usually takes ~80-2000Gib to reproduce, sometimes higher or lower but infrequent.
* I've been writing 2GiB files
* sometimes I never hit the corruption case.

* I've yet to see more than one corrupted segment in a file.


A little bit about the build/run environments;

the hardware
CentOS 7.
CentOS glibc 2.17
clang 9 / lld
Dell PowerEdge R620
Dell PowerEdge C6320
Dell PowerEdge C6420
Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

* I did compile locally on every box. I also tested every compiled binary on every box. It didn't seem to affect the results.
* I don't have a tcpdump of this yet. I'm hoping to get that started before the end of the week.
* I read and write to the same file every time, unlinking it before writing again
* I have not tried dropping the cache between any of the steps.
* I have engaged our storage vendor to see what they have to say. They're pretty good at getting useful metrics and insight so if there is anything I should have them gather server-side please let me know.


If anyone as any insight or additional testing I can perform I would *greatly* appreciate it. I would be thrilled if this turned out to be some dumb configuration option or other operational thing performed incorrectly.


Thank you for your time.

Matt Pallissard

Attachment: signature.asc
Description: PGP signature