2.1.127 Problem summary

Simon Kirby (sim@netnation.com)
Mon, 9 Nov 1998 19:00:10 -0800 (PST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Tom Vier: "Re: A patch for linux 2.1.127"
Previous message: Chris Wedgwood: "Re: halted system still working as router."

I stuck 2.1.127 on our primary mail server to see how it handled
(currently running 2.1.123 + patches). It ran perfectly for about 10
minutes and then suddenly the disk i/o almost froze as I was running
"vmstat" on one of the consoles. The number of "running" processes went
up from 10 to 20 to 30...to 280...and the disk i/o seemed to be "plugged"
with small spurts of activity, both read and write, as the process count
went up. Some of the login sessions were frozen, but I managed to get a
"ps" from one of them and they were just new pop3 logins starting, which
looks like what would happen if the /var/spool disk died. It felt as if
the aic7xxx driver was having periodic timeouts, but a "dmesg" showed no
messages from the driver. I went back to 2.1.123 for now. I'll try again
tomorrow morning when nobody's looking. ;)

I think the problem may be related to very-high-load situations. I took
the same kernel, however, and went to another unused machine and did:

repeat 220 sh -c 'head -c50m /dev/zero > foo &'

(which does 220 simultaneous "head -c50m /dev/zero > foo"s, creating quite
a bit of load). It survived and finished quite quickly, so it must be
only under unique conditions.

---

One other small problem which I'm guessing is related to the jiffies change is that the aic7xxx driver immediately spits out a timeout for scsi pid 0 the instant the driver initializes...I'm guessing this is the first command it sends out and the jiffies is initialized to (current_jiffies + timeout_time) instead of just timeout_time (or something).

---

Also, the machine spat this out shortly after bootup:

Nov 9 08:49:29 peace kernel: schedule_timeout: wrong timeout value fffcb962 from c012c59a Nov 9 08:50:09 peace kernel: schedule_timeout: wrong timeout value fffcbd0d from c012c59a

System.map:

c012c348 t max_select_fd c012c3ec T do_select c012c5e8 T sys_select (c012c59a) c012ca74 t do_poll c012cb70 T sys_poll c012ccb8 t fifo_open

Looks like a small jiffies problem again.

---

The machine it died on does:

- POP3 and IMAP hosting (~7000 accounts, ~2 new logins/sec)
- DNS for over 11,000 zones (which takes a while to load :))
- Small services web server (handling of service orders via CGI, etc.)

The machine it is running on has:

Intel P2 233MhZ (UP)
256MB SDRAM (2 x 128MB)
4 x 4.4GB UW SCSI
1 x 2.1GB SCSI
ASUS P2L97-s mainboard w/onboard AIC7880U(p) Rev. 1
Intel EtherExpress Pro 100B/+ (82557)

Kernel 2.1.127, compiled w/gcc 2.7.2.3 and binutils 2.9.1.0.15.
Patched with the small patch to make the function to get more memory not
stop with just time but also with the amount of free memory available, as
posted by Linus a short while ago.

Simon-

| Simon Kirby               |   Systems Administration |
| mailto:sim@netnation.com  | NetNation Communications |
| http://www.netnation.com/ |     Tech: (604) 684-6892 |



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Tom Vier: "Re: A patch for linux 2.1.127"
Previous message: Chris Wedgwood: "Re: halted system still working as router."