Re: kernel 2.0.30

Hubert Mantel (mantel@suse.de)
Fri, 2 May 1997 14:32:14 +0200 (MEST)


On Fri, 2 May 1997, Chel van Gennip wrote:

> Andreas Degert wrote:
> >I also suspect there is at least one bug lurking in 2.0 kernels that
> >is causing memory corruption. Some time ago on one of our servers the

[...]

> A problem occuring only once a month is horrible.

True!

> I think it is too early to exclude hardware problems.

No, at least for the sig7 problem we can _prove_ with a little test patch
that it is definitly NOT a hardware problem!

I have encountered very seldom errors:

- dircolors hanging in D-state after logging in (needs reboot)
- gcc saying: internal compiler error: Argument list too long. Restarted
and everything was fine.
- xosview dying with "Cannot open /proc/net/ip_acct"; file exists,
xosview did run before and could be restarted without problem

The first error happened a couple of times, the two others could be
observed only once until now. It happened on a really heavily loaded
machine (but enough VM available).
I thought these were hardware problems or incorrect setup, but as other
people are reporting similar strange problems (especially the "process in
D-state" problem) I assume there is a real problem in the kernel.

[...]

> I think not all 2.0 kernels suffer from a strange bug. Until now I
> have not encountered such a problem. One machine is running 2.0.0
> since july 1996 without problems. At this moment it is up for more
> than 3 months (we had to move the box some months ago)

We also have machines with long uptimes running linux 2.0.xx. This makes
it even harder to find the error. It is triggered only very seldom. We
don't know the circumstances that lead to the problems...

Zuse:/root # uname -a
Linux Zuse 2.0.28 #3 Wed Feb 5 13:33:57 CET 1997 i486
Zuse:/root # uptime
2:15pm up 48 days, 23:43h, 1 user, load average: 0.05, 0.04, 0.00

The problems never lead to a kernel panic. Uptimes may be high, but
processes dying is bad.

> Chel

Hubert mantel@suse.de

PS: If you get Signal 7 when compiling kernel, apply this patch and the
problem will go away. The patch is not the correct solution for the
problem but it proves that it is not a hardware problem. Without the
patch, the process is sent signal 7. With this patch, schedule() is called
and after that, __get_free_page does not fail again.
I get the error about once per hour on a machine running two concurrent
"make -j" in an endless loop since four days now. But instead of the
compiler getting SIG7 I get the line "Ingo's Patch was..." in the syslog.

--------------------------------------------------------------------------
diff -urN linux-2.0.30/mm/memory.c linux-2.0.30-test/mm/memory.c
--- linux-2.0.30/mm/memory.c Wed Sep 11 16:57:19 1996
+++ linux-2.0.30-test/mm/memory.c Tue Apr 29 19:33:23 1997
@@ -927,7 +927,20 @@
anonymous_page:
entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot));
if (write_access) {
- unsigned long page = __get_free_page(GFP_KERNEL);
+ /*
+ * this is a totally incorrect patch, as the problem
+ * is elsewhere
+ */
+ int count=100;
+ unsigned long page;
+repeat:
+ page = __get_free_page(GFP_KERNEL);
+ if (!page && count) {
+ printk ("Ingo's Patch was triggered with count = %d\n", count);
+ count--;
+ schedule();
+ goto repeat;
+ }
if (!page)
goto sigbus;
memset((void *) page, 0, PAGE_SIZE);
--------------------------------------------------------------------------