Oopses / ReiserFS superblock corruption with 2.6.9

From: Marek Szuba
Date: Mon Nov 15 2004 - 20:11:52 EST

Next message: Nick Piggin: "Re: [patch] scheduler: rebalance_tick interval update"
Previous message: Krzysztof Halasa: "Re: hdlc bridge"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello again,

During two weeks of running 2.6.9 I was hit by two oopses, one of them
with rather annoying and potentially disastrous consequences. I have to
say I'm rather disappointed with this, having not seen an oops for
almost a year... Anyhow, here's what happened:

1. The first oops occurred when I attempted to log in under X (X.org
6.8). The WM (blackbox) started successfully, but when Esetroot tried to
place the background image in place, the X server crashed and I got
returned to the wdm prompt - a new one though, as it was located on the
8th console rather than on the 7th. The problem was reproducible and
only went away after I'd rebooted the box. Here is the error message:

Unable to handle kernel paging request at virtual address 02014742
printing eip:
c0164823
*pde = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in: mga parport_pc lp parport ohci1394 ieee1394
emu10k1_gp snd_emu10k1 snd_rawmidi snd_pcm snd_timer snd_seq_device
snd_ac97_codec snd_page_alloc snd_util_mem snd_hwdep snd soundcore
hpt366 wacom joydev usbhid uhci_hcd usbcore evdev 8139too mii crc32
ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables w83781d
eeprom i2c_sensor i2c_isa i2c_piix4 i2c_core analog gameport rtc
nls_iso8859_2 nls_cp852 vfat fat nls_base
CPU: 0
EIP: 0060:[poll_freewait+35/80] Not tainted VLI
EFLAGS: 00013212 (2.6.9)
EIP is at poll_freewait+0x23/0x50
eax: 00000000 ebx: 0201472a ecx: c14fb260 edx: c150f1f8
esi: 000f0008 edi: 000f0000 ebp: 00020000 esp: ecf69ee0
ds: 007b es: 007b ss: 0068
Process X (pid: 1958, threadinfo=ecf68000 task=ef55faa0)
Stack: 00000000 00000000 00000012 c0164bcf ecf69f40 00000000 00000000
00000000
00020000 00000345 0003f80a 00000000 00000000 0003f80a ecf68000
ef7f7124
ef7f7104 ef7f70e4 ef7f7184 ef7f7164 ef7f7144 0001c7d3 00000001
00000000
Call Trace:
[do_select+431/720] do_select+0x1af/0x2d0
[__pollwait+0/208] __pollwait+0x0/0xd0
[sys_select+763/1328] sys_select+0x2fb/0x530
[syscall_call+7/11] syscall_call+0x7/0xb
Code: c3 8d b4 26 00 00 00 00 57 56 53 8b 44 24 10 8b 78 04 85 ff 74 3a
89 f6 8b 5f 04 8d 77 08 8d 76 00 8d bc 27 00 00 00 00 83 eb 1c <8b> 43
18 8d 53 04 e8 42 18 fb ff 8b 03 e8 cb d9 fe ff 39 f3 77

2. The second error manifested itself in that I couldn't get any
programs to run all of a sudden. While at first there were only ReiserFS
warnings on the debug console, eventually one of the programs (a shell
script, to be exact) threw a "kernel BUG" error in preempt and requested
me to reboot. The relevant snipped of the log is attached to this
message in bzip2-compressed form to conserve bandwidth.

After the reboot things got even more interesting! The system would go
through almost the whole booting procedure only to generate a
ReiserFS-related oops (which I cannot quote because it didn't get logged
anywhere), hang and become completely unresponsive the moment it tried
to access the gpm executable, located on the same partition the
aforementioned filesystem warnings referred to; again the problem was
reproducible, but didn't go away after shutting the machine down for
some time.

Having launched memtest on the machine to check if the memory chips I
had installed a few weeks earlier weren't the source of the problem
despite having been thoroughly tested with that tool just after the
installation (and once again they came out clean), I brought the system
up with a rescue disc and launched reiserfsck on all partitions. Hda6
came out clean! No such luck with the root partition though - I got told
immediately that the superblock is corrupted, of which I got a quick
glance by running df (I only roughly counted the digits, but even so I
could see my 2 GB on that partition magically expanded into at least
*tens of terabytes*)... Luckily it seems the data itself was intact as
everything went back to seemingly normal after rebuilding the
superblock, not to mention I was able to tar the contents of the whole
partition in the first place.

All said and written, I went back to 2.6.7 for the time being - its
security flaws don't bother me much (the important boxes still run 2.4,
just in case) and it was the last 2.6 kernel which didn't offer me
unwanted surprises once in a while. To think of such things happening in
a supposedly stable kernel... tsk, tsk! Of course I am aware that 2.4.9
is said to have been a much greater mess; still, somehow the last
problems I had with that branch were before 2.4.0 (even the famous
"don't use" release managed to run on one of my boxes for a full day,
not to mention compiling the next version) and I constantly get hit with
2.6.x problems (AFAIR the first version I was actually able to use on a
regular basis was 2.6.4).

Anyway, hopefully the information I've provided will be useful in
debugging the problem. If you need any more data, please let me know.

Best regards,
Marek

Attachment: reiserfs.error.bz2
Description: Binary data

Next message: Nick Piggin: "Re: [patch] scheduler: rebalance_tick interval update"
Previous message: Krzysztof Halasa: "Re: hdlc bridge"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]