Re: Hard Hang with __alloc_pages: 0-order allocation failed (gfp=0x20/1)- Not out of memory
From: Doug Dumitru
Date: Wed May 26 2004 - 13:55:21 EST
Marcelo Tosatti wrote:
On Tue, May 25, 2004 at 04:12:12PM -0700, David S. Miller wrote:
On Tue, 25 May 2004 15:26:30 -0700
Doug Dumitru <doug@xxxxxxxxxx> wrote:
This is the original trap dump from a __page_alloc error
__alloc_pages: 0-order allocation failed (gfp=0x20/1)
0x20 means GFP_ATOMIC which means it's fine to fail
and e1000 is doing nothing wrong. GFP_ATOMIC in interrupts
is a fine condition.
Yeap, but the crash is not a fine condition... I suspect
what can be happening is extreme gigabit traffic resulting in
memory shortage.
Doug said the load average is really high. Doug, you're not
using NAPI, right? Can you try it?
Prior to the __page_alloc hang, the loadavg shoots way up, so something
is spinning, but it is hard to tell what. This has persisted for as
long as 8-10 minutes on one hang, although it is usually shorter (1-2
minutes). One of my concerns is that the e1000 issue might only be a
symptom of the page tables getting clobbered by something else. I have
been trying to get the system to hang during more "controlled" usage,
but have been unable to. I have even run tsl (telnet scripting
language) scripts to logon 250 processes and beat the CPU and disk up,
creating and destroying processes along the way. I was able to drive
loadavg > 50 and LowFree to < 5000K, but could never create a hang. I
suspect that I might need truely "random" inbound traffic to find the
bug (but this is a guess).
In terms of network traffic, the system is busy, but not obnoxiously so.
The load on the server is primarily terminal traffic from about 200
"real humans", so there are a lot of small, random packets. In terms of
network bandwidth, it is not all that bad, maybe 2-3 megabits (a guess).
The arp table is reasonably big (> 200 entries) but this is not that
bad either. I have not looked for arp storms or other network anomolies
on the LAN. The system is on a local LAN and gets no internet traffic.
I am unfamiliar with NAPI, so I have not tried it.
On another topic, I am trying to build a 2.4.26 kernel that reserves
more LowFree. The mm/page_alloc.c file describes a boot parameter
called "lower_zone_reserve" that should tune this. Unfortunately, this
parameter seems to be read after the zone tables are initialized (which
is probably a bug).
--
--------------------------------------------------------------------
Doug Dumitru 800-470-2756 (610-237-2000)
EasyCo LLC doug@xxxxxxxxxx http://easyco.com
--------------------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/