Big SMP machine hangs often [debug included]

From: Miquel van Smoorenburg (miquels@cistron.nl)
Date: Wed May 17 2000 - 12:06:43 EST


[Apologies if you see this twice, but after 3 hours I still haven't
 seen my original posting on this list]

We sold a customer a big AMI Megaplex (4xPIII/500, 2GB RAM) server
but as soon as they put any load on it, they see the following
problems:

- sometimes the I/O subsytems "hangs" for 10-20 seconds
- every few days the server just hangs. Doesn't respond to pings, nothing.
  We need to press the RESET switch....

Config is:

- AMI Megaplex 4xPIII/500 2 GB RAM
- AMI MegaRaid EC9F:1.24 controller with 4 18 GB disks in RAID5 mode
- Linux 2.2.13 kernel compiled with gcc 2.7.2.3 and SMP / 2GB support

Today the machine hung again, but it did still respond to SYSRQ, so
I got the following debug output. I'd appreciate it if someone could
take a look and say if this is something that 2.2.14/2.2.15 should
solve, or that it is something else. It looks like the kernel gets
stuck in add_timer/timer_bh somehow. Note also the strange "buffer hashed"
output.

Unfortunately there is no way to force an OOPS using sysrq right now,
so I do not have a complete stack trace.

[short System.map fragment]
80110cf8 T add_timer <-----
80110e94 T mod_timer
8011102c T del_timer
80111084 T schedule_timeout

80111f38 T update_one_process
8011200c t update_process_times
80112014 t timer_bh <-----
801123ac T do_timer
80112400 T sys_alarm
[.....]

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000283
EAX: aa448ae0 EBX: aa4489c0 ECX: 801d6584 EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000283
EAX: fbdeb72c EBX: aa4489c0 ECX: 801d6558 EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<80110e78>] EFLAGS: 00000246
EAX: 801d664c EBX: 00000246 ECX: aa448af4 EDX: 000000ec
ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000287
EAX: f67045ac EBX: aa4489c0 ECX: 801d63c0 EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<80110e78>] EFLAGS: 00000246
EAX: 801d6504 EBX: 00000246 ECX: aa448af4 EDX: 0000009a
ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<80110e78>] EFLAGS: 00000246
EAX: 801d6318 EBX: 00000246 ECX: aa448af4 EDX: 0000001f
ESI: 8015d060 EDI: 00000001 EBP: 81e97c9c DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Regs

EIP: 0010:[<8011234c>] EFLAGS: 00000283
EAX: aa448ae0 EBX: aa4489c0 ECX: 801d665c EDX: aa448af4
ESI: 8015d060 EDI: 00000001 EBP: 81e97cc8 DS: 0018 ES: 0018
CR0: 8005003b CR2: 2acae000 CR3: 3a30f000

SysRq: Show Memory
Mem-info:
Free pages: 71316kB
 ( Free: 17829 (256 512 768)
NonDMA: 8179*4kB 3695*8kB 489*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024k
B 0*2048kB = 70772kB)
DMA: 0*4kB 0*8kB 2*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB
= 544kB)
Swap cache: add 6339, delete 6330, find 329846/339376
Free swap: 129880kB
507904 pages of RAM
5456 reserved pages
52779 pages shared
9 pages swap cached
361640 pages in file cache
361649 pages in page cache
24 pages in page table cache
Buffer memory: 282680kB
Buffer heads: 71303
Buffer blocks: 71267
Buffer hashed: -3666981
   CLEAN: 70927 buffers, 83 used (last=70758), 0 locked, 0 protected, 0 dirty
  LOCKED: 74 buffers, 0 used (last=0), 0 locked, 0 protected, 0 dirty
   DIRTY: 190 buffers, 9 used (last=177), 0 locked, 0 protected, 190 dirty
Networking buffers in use : 780
Total network buffer allocations : 2201836468
Total failed network buffer allocs : 0
IP fragment buffer size : 0

Mike.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue May 23 2000 - 21:00:13 EST