More epic100 problems in 2.2.15pre2

From: Mark Hagger (mhagger@dera.gov.uk)
Date: Tue Jan 18 2000 - 04:42:01 EST


I'm having problems with the network dropping out under high memory and network
load using an SMC9432 card. At this point the machine is swapping pretty
heavily and seeing quite a fair bit of i/o buffering.

What happens is that everything is chugging along happily (this is a block of 8
dual processor machines with 512MBytes RAM each and another 1Gbyte of swap).
During an operation that puts more memory requirements on the machine and
initiates a lot of network traffic one of the machines (pretty much random
which one) will suddenly drop off the networked world.

A closer invesigation reveals that the machine is unable to allocate any more
memory, console logins fail with "init: unable to fork" whilst an inspection of
the memory status using the magic-sysrq 'm' shows that pretty much all the
memory is in use and about half the swap space but the most interesting part is
the network statistics which show something like:

Network buffers in use: 872
Total network allocations: 1412505
Total failed network buffer allocs: 567

the last number sits there increasing every now and then.

Presumably as memory has got tight we've got to a state where memory can no
longer be allocated for the kernel and the network stalls. My
understanding is that this shouldn't happen, at worst my userland programs
should be spiked to free up some memory.

At this point I'm in over my depth, but if I SIGTERM processes using the
magic-sysrq then the network does come back to life.

I did try one more option which was to increase the values in
/proc/sys/vm/freepages (from 256 512 768 to 1024 2048 3072), I still got the
same hang-up happening, but this time the machine would go completely dead,
with a blank screen and no response to magic sysrq keys.

I know I've said this before, and I suspect that I'll say it again, but in my
humble opinion pretty much the only part of the linux 2.2.x kernel that is not
rock solid is the network (drivers), (although to be fair I only have
experience of eepro100 and epic100 drivers). This is a bit unfortunate since
many of us are going towards "Beowulf" type systems (with large numbers of dual
board PC's - I have 16 such machines on a single network), and using
parallel/distributed processing techniques to achieve supercomputer like
performance in numerical applications. Sadly my experience is that if I hit
the machines with high CPU/memory load via computations AND high network
traffic then the machines either lock up, fall over or occasionally kernel
oops. However, high CPU/memory load by itself has never been sufficient to
cause problems. (Of course I realise that the problem above may not be directly
related to the network drivers.)

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:17 EST