Hello Alan
I know you finally found and fixed the bridging code leak somewhere in
the recent 2.0.36pre series or just before. But I haven't been able to
figure out what the fix was, by inspection. I would be deeply grateful
if you could tell me what the line or lines was .. I think some
intrepid soul managed to find the line that did the damage using the
memleak patches.
I saw the leak first on a server I had at 2.0.33. P100 with 3c905
and buslogic fast and wide. It went down in a week serving NFS. You
steered me to the cause and workaround.
I disabled bridging on the kernel (it only had one card) and it became
stable as a rock.
At the same time - several months ago now - I took the kernel and put it
in the server next door to it, P200 with a 3c900 and adaptec fast and
narrow. That has been stable as anything. Same binary kernel. Not
much NFS load.
Now I (by mistake) took the same binary kernel and put it in a PP200
serving heavy NFS through a single 3c905 on a 100BT net with adaptec
fast and wide scsi. That went down in 24 hours with all it's 128M
memory used up - no user space usage to speak off. It was running
mrouted and an mbone tunnel when it died.
It was clearly "network buffer" leakage. But I tried _enabling_ its
dormant bridging code near the end, and it went down in 20 minutes.
When it came back up fresh I tried the enable again, and it went down
again in about 10 mins while I watched - with lots of network pauses.
It's now looking stable on a recompiled kernel without bridging, same
configuration otherwise.
I would love to have just the fix for this as a patch. I know that's
too much to ask, so I'm asking just to be clued in on the eureka that
solved this.
Thank you if you can manage that ...
Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/