OOPS in 2.0.33 find_buffer/get_hash_table/ext2fs trunc_indirect

Chris Siebenmann (cks@hawkwind.utcs.toronto.edu)
Fri, 20 Mar 1998 05:23:47 -0500


I've seen the following OOPS twice now in similar situations. I'm doing
a simplistic stress test (make a directory hierarchy; make a bunch of
copies of a file to populate the directory hierarchy; cmp all the copies
with the original; remove each copy and the directory hierarchy; repeat)
and after running for a couple of hours the following happens:

kernel: general protection: 0000
kernel: CPU: 0
kernel: EIP: 0010:[get_hash_table+52/172]
kernel: EFLAGS: 00010206
kernel: eax: 62701d18 ebx: 009f0809 ecx: 02ccc000 edx: 000001b5
kernel: esi: 00000809 edi: 000c29bc ebp: 00000400 esp: 02412ecc
kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
kernel: Process rm (pid: 12585, process nr: 4, stackpage=02412000)
kernel: Stack: 009f092c 00000000 000c29bc 0000004b 009f1218 000001b5 0015fcd5 00000809
kernel: 000c29bc 00000400 02cc8000 02ccc000 00000000 00000005 fffffff4 00000002
kernel: 00000100 00000000 0000004b 000c2971 009f1218 001602bb 02ccc000 0000000c
kernel: Call Trace: [trunc_indirect+257/712] [ext2_truncate+91/360] [ext2_put_inode+66/104] [ext2_put_inode+88/104] [iput+213/412] [ext2_unlink+528/548] [do_unlink+260/296]
kernel: [sys_unlink+38/60] [system_call+85/124]
kernel: Code: 39 38 75 24 66 39 58 04 75 1e 39 68 20 74 22 56 e8 77 f9 ff

The second time produced additional kernel messages beforehand:
Mar 19 18:50:50 hawklords kernel: EXT2-fs error (device 08:09): ext2_free_blocks: Freeing blocks not in datazone - block = 807358203, count = 1
Mar 19 18:50:50 hawklords kernel: EXT2-fs error (device 08:09): ext2_free_blocks: Freeing blocks not in datazone - block = 3760148221, count = 1
Mar 19 18:50:50 hawklords kernel: EXT2-fs error (device 08:09): ext2_free_blocks: Freeing blocks not in datazone - block = 2954841855, count = 1
Mar 19 18:50:50 hawklords kernel: EXT2-fs error (device 08:09): ext2_free_blocks: Freeing blocks not in datazone - block = 2686406401, count = 1
And then at 20:15:13 (from syslog timestamps):
kernel: general protection: 0000
kernel: CPU: 0
kernel: EIP: 0010:[get_hash_table+52/172]
kernel: EFLAGS: 00010206
kernel: eax: 50000000 ebx: 032c0809 ecx: 03b98200 edx: 000007dc
kernel: esi: 00000809 edi: 000c2fd5 ebp: 00000400 esp: 00435e8c
kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
kernel: Process rm (pid: 330, process nr: 25, stackpage=00435000)
kernel: Stack: 032c7114 00000000 000c2fd5 00000045 02fcc118 000007dc 0015fcd5 00000809
kernel: 000c2fd5 00000400 00000000 00000100 03b98200 ffffffff fffffef4 00000002
kernel: 00000100 00000000 00000045 000c2f90 02fcc118 0015ff8a 03b98200 0000010c
kernel: Call Trace: [trunc_indirect+257/712] [trunc_dindirect+238/476] [ext2_truncate+119/360] [ext2_put_inode+66/104] [ext2_put_inode+88/104] [iput+213/412] [ext2_unlink+528/548]
kernel: [do_unlink+260/296] [sys_unlink+38/60] [system_call+85/124]
kernel: Code: 39 38 75 24 66 39 58 04 75 1e 39 68 20 74 22 56 e8 77 f9 ff

Following the guidelines in Documentation/oops-tracing.txt, I believe that
this is coming from the first tmp dereference in
if (tmp->b_blocknr == block && tmp->b_dev == dev)
in the inlined call of find_buffer in get_hash_table (line 454 in my
fs/buffer.c; the inlined call is at line 477).

Both times the system stayed up, although the rm hung unkillably. The
second time sscanf() in the shared library started giving a lot of
programs that used it SIGSEGV's (including ps and procinfo), which may
or may not be related.

No other kernel errors/messages were reported, and in the second case
I know none were printed on the console (the first time around, I didn't
see the console before the system was rebooted).

fsck.ext2 detects no errors that it mentions in the filesystem after
forced reboots, although the second time around it said that the
filesystem was marked as having errors -- I presume when it hit
the ext2_free_blocks messages it marked the FS as such.

Software: uniprocessor 2.0.33 kernel plus FAT32 patches, with a
non-modular aic7xxx driver. Otherwise RedHat 5.0. The system was
idle apart from the disk thrasher test for both oopses.

Hardware: ASUS P2L97-S motherboard (integrated AIC-7880 UltraSCSI
controller), (wide) Seagate Barracuda. Oopses happened with both fast
SCSI and ultra SCSI data rates; the second time it was the only device
on the SCSI chain. The SCSI chain is properly terminated as far as I've
been able to determine.

I will see messages on linux-kernel, although not as fast as direct
email to me. For quick replies if you have additional questions or
something you want me to try, email or cc: me directly. I hope I've
given enough (but not too much) details about this.

---
"there used to be two moons
then one of them
discovered coffee." - Curtis Yarvin
cks@hawkwind.utcs.toronto.edu ...!{utgpu,utzoo,watmath}!utgpu!cks

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu