NFS Data CORRUPTION Between Linux and SunOS 5.5.1

Ben McCann (bmccann@indusriver.com)
Thu, 17 Sep 1998 13:58:55 -0400


This is a multi-part message in MIME format.
--------------7C0C6B2DC3EC35AC68E8F1B0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I've attached mail which I posted to the linux kernel mailing list
about a month ago. I have some additional information to report
which I hope you can forward to the appropriate Linux NFS guru.

We see NFS data corruption between an Linux NFS client and a SunOS
NFS server. It occurs when running 'ld' which, I assume, does
extensive random access to the file. Under 2.1.102, our test case
fails with almost EVERY link with 'ld'. (BTW, it works fine with
2.1.84).

I was unable to reexamine this problem until this week so I thought
any further testing of 2.1.102 was silly given 2.1.121 has been
released. So, I've retested with 2.1.121 compiled for both UP and
SMP. The problem is MUCH better, but it still occurs. I ran 'ld'
over our test set of objects, writing the final executable to an
NFS mounted file system. I had 3 failures in 120 trials.

As before, the corruption always happens exactly on a 4K offset
in the file. The corruption takes a 4K block of the file and
shifts it down in memory 1, 2, or 3 bytes, inserting zeros at
the beginning of that page.

I read on the list that substantial cleanup has occurred in the
IP and NFS areas in the last 20 point releases. They've helped.

Can those developer's look at those changes, and this failure
mode, to guess where they might have missed one more fix?

-Ben McCann

-- 
Ben McCann                              Indus River Networks
                                        31 Nagog Park
                                        Acton, MA, 01720
email: bmccann@indusriver.com           web: www.indusriver.com 
phone: (978) 266-8140                   fax: (978) 266-8111
--------------7C0C6B2DC3EC35AC68E8F1B0
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Return-Path: <bmccann@indusriver.com> Received: from indusriver.com (209.6.112.94) by mcfeeley.indusriver.com (Worldmail 1.3.167); 13 Aug 1998 18:12:48 -0400 Message-ID: <35D364D2.D3662ECD@indusriver.com> Date: Thu, 13 Aug 1998 18:12:34 -0400 From: Ben McCann <bmccann@indusriver.com> X-Mailer: Mozilla 4.05 [en] (Win95; I) MIME-Version: 1.0 To: "linux-kernel@vger.rutgers.edu" <linux-kernel@vger.rutgers.edu> Subject: NFS Data CORRUPTION Between Linux and SunOS 5.5.1 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit

We use Linux 2.1.x for software development where Linux workstations NFS mount filesystems on a Sun UltraSparc server. The Ultra runs SunOs 5.5.1.

We ran 2.1.84 with no problems. We recently upgraded our build environment to 2.1.102. (We've been using 2.1.102 in application testing for a couple of months so we decided it was stable enough to use for compiling and linking too).

Immediately after upgrading, we noticed that our executable files were corrupted during the link phase of a build. Remember that the objects and the executable are all stored on the UltraSparc server. If we link under Linux 2.1.84 then there is no corruption and if it is 2.1.102 then there IS corruption.

====> I've repeated this with 2.1.115 so the bug is still alive ====> in the latest edition of the kernel.

This is a very puzzling bug. We do NOT see corruption when we link directly to the local hard drive and we don't see corruption when we NFS mount another 2.1.102 Linux box and link on its file system.

The only corruption occurs when running 'ld' under 2.1.102 (or 2.1.115) and writing the executable to a SunOS 5.5.1 NFS server. (BTW, we using GNU ld version 2.8.1 (with BFD linux-2.8.1.0.1)).

I can spend some time helping with a 'remote debug' of this problem if there are tools, logs, debug switches, etc, that can be thrown to gather data here. I also have a set of objects which I can probably ship to a Linux developer to reproduce this bug. He/she just needs a SunOS box handy. Alternatively, the NFS/TCP/UDP developer's can try to track the source differences between 2.1.84 and 2.1.102.

IMHO, its a serious problem which needs attention.

-Ben McCann

-Ben McCann

-- 
Ben McCann                              Indus River Networks
                                        31 Nagog Park
                                        Acton, MA, 01720
email: bmccann@indusriver.com           web: www.indusriver.com 
phone: (978) 266-8140                   fax: (978) 266-8111

--------------7C0C6B2DC3EC35AC68E8F1B0--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/