Re: [2.2.13aa6 (bugfix release II) ]

From: Andrea Arcangeli (andrea@suse.de)
Date: Sun Jan 09 2000 - 15:43:01 EST


On 20 Dec 1999, ursus wrote:

>I upgraded a cluster of servers (Compaq 6400R, 2 x PIII-500)
>from 2.2.13ac3 to 2.2.13aa6+raid-0.90 (and the incremental
>"set_blocksize" patch you kindly provided) and Don Becker's
>eepro.c 1.09l (not sure if this is latest?) in hopes I can
>finally have a really stable setup ... these had been running
>well for about 12 hours, but I just had one of the servers
>crash with the following error (seen before under 2.2.13ac3):
>
> wait_on_bh, CPU 3: (this is the first processor)
> irq: 0 [0 0]
> bh: 1 [0 0]
> <[8010b39d]> <[80150daa]> <[80150d46]> <[8012912b]> \
> <[8012a367]> <[801291a6]> <[8012921f]> <[801092ac]>
>
>I tried to correlate the registers above with System.map:
>
> 8010b360 T synchronize_bh
> 8010b3b0 T synchronize_irq
>
> 80150d20 t sock_close
> 80150d5c t sock_fasync
>
> 8012910c T __fput
> 80129154 T filp_close
>
> 8012a350 T fput
> 8012a398 T put_filp
>
> 80129154 T filp_close
> 801291b0 T sys_close
>
> 801291b0 T sys_close
> 80129238 T sys_vhangup
>
> 80109278 T system_call
> 801092b0 T ret_from_sys_call
>
>If I press ALT+SysRq+P, the EIP shows "0010:[<80166671>]"
>which appears to be related to functions (from System.map):
>
> 80166660 T tcp_send_delayed_ack
> 801666b4 T tcp_send_ack
>
>In some earlier posts I read that "wait_on_bh"
>means that the system is waiting on the bottom half
>(SMP-specific), so I've edited my /etc/lilo.conf
>to add "nosmp noapic", and I'll see if the servers
>run stable w/o SMP ... this isn't a real solution
>of course.
>
>Any help/pointers/patches would be greatly appreciated.

I think I spotted and fixed the bug that is soft-deadlocking your 2.2.x
compaq cluster (all seems to make sense :). Could you try the below patch
against 2.2.14 (or 2.2.14aa1 or 2.2.13 or 2.2.13aa6)?

--- 2.2.14/net/ipv4/tcp_output.c.~1~ Fri Jan 7 18:19:25 2000
+++ 2.2.14/net/ipv4/tcp_output.c Sun Jan 9 21:32:04 2000
@@ -1004,7 +1004,7 @@
         unsigned long timeout;
 
         /* Stay within the limit we were given */
- timeout = tp->ato;
+ timeout = (tp->ato << 1) >> 1;
         if (timeout > max_timeout)
                 timeout = max_timeout;
         timeout += jiffies;

I uploaded the above patch here too:

        ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/delack-timer-1.gz

Have fun!

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Jan 15 2000 - 21:00:15 EST