Re: new processes very slow on otherwise responsive system (2.1.

Jim Bauer (jfbauer@home.com)
Mon, 16 Nov 1998 01:04:38 -0500 (EST)


I tried several different things to track down this problem.
My results are posted below.

On 14-Nov-98 Andrea Arcangeli wrote:
> On 13 Nov 1998, Tkil wrote:
>
>>on both 2.1.127 and 2.1.128, heavy exec() (with a fair bit of I/O)
>>seems to cause the system to go into a busy wait state. during this
>>time, it is difficult/impossible/very slow to launch any new
>>processes.
>
> Could you try again on my tree
> ftp://e-mind.com/pub/linux/kernel-patches/arca-19-...diff.gz and feedback?
> I changed schedule_timeout() to avoid inserting the timeout timer if the
> function is been recalled by a proces in in TASK_RUNNING state. That make
> tons of sense. schedule_timeout() will also log if it's been recalled with
> a state != TASKrunning or != TASK interruptible.

Tried it, see below.

> I read that problems arise when the system is been idle for some time,
> maybe this is the reason everything has worked fine so far here...

I have seen it happen when running fork-test after only a minute
or two of uptime.

>>if anyone would like further information, please just drop me a line
>>and i'd be glad to get it for you.
>
> I _think_ to be the cause of the problem with my schedule_timeout(). I' ll
> try to reproduce here. Thanks and excuse me. I am doing my best (and this
> night I had a long sleep so I hope to be more smart then yesterday today
> ;-).
>
> Andrea Arcangeli

This is with 2.1.127 non-SMP compiled with gcc 2.7.2.3
APM not enabled in kernel or bios.
All SCSI, no IDE

Per someones suggestion as to the problem being interrupt related,
I ping flodded the system (200MHz K6) from a P90 for 30 minutes
without an problems. It never even lost a packet. I also gernerated
outgoing ping-floods. That ran ok, but did consude all the CPU
time and through the interrupts and context switches to about 5000-
6000 per second.

I am able to get fork-test (posted to l-k a while back) to cause
the freeze. I find it quite easy to do under 2.1.127.

I just got it to freeze and recover on its own. Till now the
freezes have been permenent. Perhaps that depends on how quickly you
try to kill things when the problem shows up. Or maybe I just
didn't wait long enough the other times.

During the freeze, "vmstat 5" showed 0 idle time. The interruts
and context switches were only in the low 100s/second. Top showed
some processes stuck in a runnable state. This included fork-test,
some shells (if they were trying to run a command), and some crond-s (that
were presumably trying to run a new job). Almost all of the CPU time
was split evenly between them. Already running programs continued
to work fine. Even file I/O was working for existing programs. It
appears as if the problem was in creating or execing new processes.
Although based on what I am seeing, fork() at least returns quickly
for the parent.

When a freeze first started and before it got too bad (i.e. lots of runnable
processes), I ping flodded the system. It now lost about 30%
of the packets. When the freeze got much more severe, a ping flood
was at 100% packet loss. Also, this time I got some errors from the
ethernet driver (tulip) complaining about timeouts.

I tried 2.1.127 SMP (on a non-SMP system). I could _not_ get the
freeze to happen at all.

I tried 2.1.128 + arca19 (non-SMP). I got it to freeze, but not as easily
and plain 2.1.127.

While the system was froze quite bad, I tried a ping flood. I got all
put 15% of 11031 packets. But the surprising part was that the ping
flood unfroze the system. That happened 3 different times. Once with
2.1.127 and twice with 2.1.128+arca19. I thought I imagined it the
first time. However, the ethernet driver (tulip, non-module) spit out the
following and stopped working.

eth0: 21041 transmit timed out, status fc660000, CSR12 000001c8, CSR13 ffffef05, CSR14 ffffff3f, resetting...
eth0: 21041 transmit timed out, status fc660010, CSR12 000052c8, CSR13 ffffef09, CSR14 fffff7fd, resetting...
eth0: 21041 transmit timed out, status fc660000, CSR12 000001c8, CSR13 ffffef05, CSR14 ffffff3f, resetting...
eth0: 21041 transmit timed out, status fc660010, CSR12 000052c8, CSR13 ffffef09, CSR14 fffff7fd, resetting...
eth0: 21041 transmit timed out, status fc660000, CSR12 000001c8, CSR13 ffffef05, CSR14 ffffff3f, resetting...
[repeats many times]

That's all. Hope it helps. I am certainly willing to try
out other suggestions.

Jim Bauer, jfbauer@home.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/