Re: 2.6.17-mm5

From: Andrew Morton
Date: Sat Jul 01 2006 - 17:29:22 EST


On Sat, 1 Jul 2006 15:24:19 +0100
Grant Wilson <grant.wilson@xxxxxxxxx> wrote:

> More RAID1 problems - OOPS on shutdown.

Thanks. Please copy the mailing lists on these reports - I'm not an MD,
SCSI or SATA developer, and this is in their area.

> [ 37.482699] md: Autodetecting RAID arrays.
> [ 37.547908] md: autorun ...
> [ 37.566449] md: considering sdb2 ...
> [ 37.589664] md: adding sdb2 ...
> [ 37.610757] md: adding sda2 ...
> [ 37.632116] md: created md1
> [ 37.650587] md: bind<sda2>
> [ 37.668571] md: bind<sdb2>
> [ 37.686541] md: running: <sdb2><sda2>
> [ 37.710807] raid1: raid set md1 active with 2 out of 2 mirrors
> [ 37.747557] md: ... autorun DONE.
> [ 37.784444] EXT3-fs: INFO: recovery required on readonly filesystem.
> [ 37.824275] EXT3-fs: write access will be enabled during recovery.
> [ 38.814113] kjournald starting. Commit interval 5 seconds
> [ 38.848761] EXT3-fs: sdc1: orphan cleanup on readonly fs
> [ 38.985436] EXT3-fs: sdc1: 7 orphan inodes deleted
> [ 39.015845] EXT3-fs: recovery complete.
> [ 39.072168] EXT3-fs: mounted filesystem with ordered data mode.
> [ 44.693986] Adding 995988k swap on /dev/sda1. Priority:-1 extents:1 across:995988k
> [ 44.744558] Adding 995988k swap on /dev/sdb1. Priority:-2 extents:1 across:995988k
> [ 44.966034] EXT3 FS on sdc1, internal journal
> [ 49.305350] device-mapper: ioctl: 4.8.0-ioctl (2006-06-24) initialised: dm-devel@xxxxxxxxxx
> [ 64.091331] raid1: Disk failure on sdb2, disabling device.
> [ 64.091333] Operation continuing on 1 devices
> [ 64.212624] RAID1 conf printout:
> [ 64.233951] --- wd:1 rd:2
> [ 64.252195] disk 0, wo:0, o:1, dev:sda2
> [ 64.277712] disk 1, wo:1, o:0, dev:sdb2
> [ 64.305627] RAID1 conf printout:
> [ 64.326977] --- wd:1 rd:2
> [ 64.345220] disk 0, wo:0, o:1, dev:sda2
> [

Which device drivers are being used for these disks?

> [ 155.123022] Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP:
> [ 155.155867] [<ffffffff8047157a>] md_error+0x45/0x91
> [ 155.200353] PGD 77954067 PUD 726e5067 PMD 0
> [ 155.226233] Oops: 0000 [1] PREEMPT SMP
> [ 155.249516] last sysfs file: /devices/system/cpu/cpu0/cpufreq/scaling_setspeed
> [ 155.292808] CPU 0
> [ 155.304968] Modules linked in: dm_mod evdev
> [ 155.330331] Pid: 0, comm: swapper Not tainted 2.6.17-mm5 #1
> [ 155.363697] RIP: 0010:[<ffffffff8047157a>] [<ffffffff8047157a>] md_error+0x45/0x91
> [ 155.409638] RSP: 0018:ffffffff807a0c50 EFLAGS: 00010046
> [ 155.441445] RAX: 0000000000000000 RBX: ffff81007aa34708 RCX: 000000000000003f
> [ 155.484216] RDX: 00000000fffffffb RSI: ffff81007a821d28 RDI: ffff81007aa34708
> [ 155.526989] RBP: ffffffff807a0c60 R08: 0000000000000000 R09: ffff81007aac43b0
> [ 155.569759] R10: ffffffff804221e5 R11: 0000000000000058 R12: ffff81007aac4ab0
> [ 155.612533] R13: ffff81007aac43b0 R14: ffff81007aac4ab0 R15: 00000000fffffffb
> [ 155.655303] FS: 00002aeb361606d0(0000) GS:ffffffff80a46000(0000) knlGS:0000000000000000
> [ 155.703791] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 155.738195] CR2: 0000000000000048 CR3: 0000000070997000 CR4: 00000000000006e0
> [ 155.780969] Process swapper (pid: 0, threadinfo ffffffff80a64000, task ffffffff80696a00)
> [ 155.829404] Stack: ffff81007a821d28 ffff81007aa34708 ffffffff807a0c80 ffffffff804728d9
> [ 155.877840] ffff81007a821d28 ffff81007aa34708 ffffffff807a0cc0 ffffffff8047409c
> [ 155.922535] 00001000807a0d00 ffff81007aac4ab0 00000000fffffffb ffff81007aac4ab0
> [ 155.966085] Call Trace:
> [ 155.982416] [<ffffffff804728d9>] super_written+0x30/0x65
> [ 156.015292] [<ffffffff8047409c>] super_written_barrier+0xc4/0xd1
> [ 156.052297] [<ffffffff8023a5a5>] bio_endio+0x56/0x5b
> [ 156.082688] [<ffffffff8022d21b>] __end_that_request_first+0x1c9/0x4c9
> [ 156.122068] [<ffffffff8024a0d6>] end_that_request_first+0xc/0xe
> [ 156.158343] [<ffffffff8036a692>] blk_ordered_complete_seq+0x7c/0x8b
> [ 156.196705] [<ffffffff8036a6d1>] post_flush_end_io+0x30/0x35
> [ 156.231419] [<ffffffff8036a5b5>] end_that_request_last+0xd9/0xf6
> [ 156.268215] [<ffffffff80422204>] scsi_end_request+0xad/0xd7
> [ 156.302573] [<ffffffff80422637>] scsi_io_completion+0x3e1/0x3f0
> [ 156.339004] [<ffffffff8042266c>] scsi_blk_pc_done+0x26/0x28
> [ 156.373357] [<ffffffff8041d11e>] scsi_finish_command+0xa9/0xb2
> [ 156.409264] [<ffffffff804229f9>] scsi_softirq_done+0xf4/0xfd
> [ 156.444143] [<ffffffff80237f66>] blk_done_softirq+0x70/0x7f
> [ 156.478323] [<ffffffff80211366>] __do_softirq+0x67/0xf4
> [ 156.510224] [<ffffffff8025f95e>] call_softirq+0x1e/0x28
> [ 156.542083]
> [ 156.542083] Code: 48 8b 40 48 48 85 c0 74 3f ff d0 f0 0f ba ab e0 01 00 00 03

The barrier code is in there again.

mddev->pers is NULL in md_error(), so the test of
!mddev->pers->error_handler oopsed. Perhaps this is a real MD bug which is
now being exposed by the new barrier-handling problem.


This should get you further, but...

From: Andrew Morton <akpm@xxxxxxxx>

Cc: Neil Brown <neilb@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
---

drivers/md/md.c | 2 ++
1 file changed, 2 insertions(+)

diff -puN drivers/md/md.c~md-oops-workaround drivers/md/md.c
--- a/drivers/md/md.c~md-oops-workaround
+++ a/drivers/md/md.c
@@ -4586,6 +4586,8 @@ void md_error(mddev_t *mddev, mdk_rdev_t
__builtin_return_address(0),__builtin_return_address(1),
__builtin_return_address(2),__builtin_return_address(3));
*/
+ if (!mddev->pers)
+ return;
if (!mddev->pers->error_handler)
return;
mddev->pers->error_handler(mddev,rdev);
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/