Re: maybe a buffers bug? - Re: NTFS module is buggy

From: Steven S. Dick (ssd@nevets.oau.org)
Date: Sat Jun 10 2000 - 00:29:59 EST


Rik van Riel <riel@conectiva.com.br> wrote:
>> find: cannot fork: Cannot allocate memory
>
>This means the kernel cannot find an 8kB contiguous (2 pages)
[...]
>
>The extremely high number of buffers, however, makes me suspect
>that something (the NTFS driver?) makes the buffers unfreeable
>so shrink_mmap cannot succeed in freeing the pages, this would
>cause exactly the memory fragmentation the fork error indicates.
[...]
>However, if try_to_free_buffers() (from fs/buffer.c) fails because
>the filesystem is doing strange things with the buffer state, then
>we'll be in deep shit ...

Here's a couple of possibilities:

1) there's a race condition between NTFS and something else
   that is only triggered during rapid context switches

2) there's a memory leak in NTFS

(and, logically)
3) there's a memory leak in NTFS caused by a race condition :)

The output from md5sum is going to a log file on an NFS partition.
There could even be a race condition between NTFS and NFS I suppose.

In watching this, I've seen several different kinds of failures.
I think I've seen the out of memory condition--but the machine reboots
almost instantly when I think I see that, so I'm not sure.

I've seen an outright OOPS with a stack trace that pointed to the EEPRO/100
card, (I think it killed interrupt handlers) but that could easily have
been buffer corruption by NTFS. At first, I blamed the EEPRO, but then
it crashed on a 3com card and on an RTL card, and on a tulip card,
so I doubt the network card is involved.

I've seen a panic/halt generated by the memory allocation system
saying its structures had been corrupted.

I've got 2 labs of 20 machines each with widely varied hardware on a very
busy network. The slower machines seem to not have much of a problem.
Some machines never crash--even a few fast ones. Some machines crash
every time. They seem to not crash on the same file, but frequently
in the same directory. Which ones crash or not does not seem to
be hardware related, as I have machines with identical physical configs
where one crashes every time and the other never crashes.

I strongly suspect a particular access pattern (lots of small files
all in the same directory perhaps) causes this.

And
  find /nt -type f -exec md5sum {} \;
isn't exactly the nicest thing you could do to a disk. :->
Maybe I should try xargs instead just to see if it crashes less.

>It could be anything, but the fact that I've never witnessed
>such a high amount of buffer pages with ext2 tests makes the
>NTFS driver primary suspect ...

I tend to agree.

        Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 21:00:20 EST