Re: 2.6.20.3 AMD64 oops in CFQ code

From: Dan Williams
Date: Thu Mar 22 2007 - 20:31:38 EST


On 3/22/07, Neil Brown <neilb@xxxxxxx> wrote:
On Thursday March 22, jens.axboe@xxxxxxxxxx wrote:
> On Thu, Mar 22 2007, linux@xxxxxxxxxxx wrote:
> > > 3 (I think) seperate instances of this, each involving raid5. Is your
> > > array degraded or fully operational?
> >
> > Ding! A drive fell out the other day, which is why the problems only
> > appeared recently.
> >
> > md5 : active raid5 sdf4[5] sdd4[3] sdc4[2] sdb4[1] sda4[0]
> > 1719155200 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUUU_U]
> > bitmap: 149/164 pages [596KB], 1024KB chunk
> >
> > H'm... this means that my alarm scripts aren't working. Well, that's
> > good to know. The drive is being re-integrated now.
>
> Heh, at least something good came out of this bug then :-)
> But that's reaffirming. Neil, are you following this? It smells somewhat
> fishy wrt raid5.

Yes, I've been trying to pay attention....

The evidence does seem to point to raid5 and degraded arrays being
implicated. However I'm having trouble finding how the fact that an array
is degraded would be visible down in the elevator except for having a
slightly different distribution of reads and writes.

One possible way is that if an array is degraded, then some read
requests will go through the stripe cache rather than direct to the
device. However I would more expect the direct-to-device path to have
problems as it is much newer code. Going through the cache for reads
is very well tested code - and reads come from the cache for most
writes anyway, so the elevator will still see lots of single-page.
reads. It only ever sees single-page write.

There might be more pressure on the stripe cache when running
degraded, so we might call the ->unplug_fn a little more often, but I
doubt that would be noticeable.

As you seem to suggest by the patch, it does look like some sort of
unlocked access to the cfq_queue structure. However apart from the
comment before cfq_exit_single_io_context being in the wrong place
(should be before __cfq_exit_single_io_context) I cannot see anything
obviously wrong with the locking around that structure.

So I'm afraid I'm stumped too.

NeilBrown

Not a cfq failure, but I have been able to reproduce a different oops
at array stop time while i/o's were pending. I have not dug into it
enough to suggest a patch, but I wonder if it is somehow related to
the cfq failure since it involves congestion and drives going away:

md: md0: recovery done.
Unable to handle kernel NULL pointer dereference at virtual address 000000bc
pgd = 40004000
[000000bc] *pgd=00000000
Internal error: Oops: 17 [#1]
Modules linked in:
CPU: 0
PC is at raid5_congested+0x14/0x5c
LR is at sync_sb_inodes+0x278/0x2ec
pc : [<402801cc>] lr : [<400a39e8>] Not tainted
sp : 8a3e3ec4 ip : 8a3e3ed4 fp : 8a3e3ed0
r10: 40474878 r9 : 40474870 r8 : 40439710
r7 : 8a3e3f30 r6 : bfa76b78 r5 : 4161dc08 r4 : 40474800
r3 : 402801b8 r2 : 00000004 r1 : 00000001 r0 : 00000000
Flags: nzCv IRQs on FIQs on Mode SVC_32 Segment kernel
Control: 400397F
Table: 7B7D4018 DAC: 00000035
Process pdflush (pid: 1371, stack limit = 0x8a3e2250)
Stack: (0x8a3e3ec4 to 0x8a3e4000)
3ec0: 8a3e3f04 8a3e3ed4 400a39e8 402801c4 8a3e3f24 000129f9 40474800
3ee0: 4047483c 40439a44 8a3e3f30 40439710 40438a48 4045ae68 8a3e3f24 8a3e3f08
3f00: 400a3ca0 400a377c 8a3e3f30 00001162 00012bed 40438a48 8a3e3f78 8a3e3f28
3f20: 40069b58 400a3bfc 00011e41 8a3e3f38 00000000 00000000 8a3e3f28 00000400
3f40: 00000000 00000000 00000000 00000000 00000000 00000025 8a3e3f80 8a3e3f8c
3f60: 40439750 8a3e2000 40438a48 8a3e3fc0 8a3e3f7c 4006ab68 40069a8c 00000001
3f80: bfae2ac0 40069a80 00000000 8a3e3f8c 8a3e3f8c 00012805 00000000 8a3e2000
3fa0: 9e7e1f1c 4006aa40 00000001 00000000 fffffffc 8a3e3ff4 8a3e3fc4 4005461c
3fc0: 4006aa4c 00000001 ffffffff ffffffff 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 8a3e3ff8 40042320 40054520 00000000 00000000
Backtrace:
[<402801b8>] (raid5_congested+0x0/0x5c) from [<400a39e8>]
(sync_sb_inodes+0x278/0x2ec)
[<400a3770>] (sync_sb_inodes+0x0/0x2ec) from [<400a3ca0>]
(writeback_inodes+0xb0/0xb8)
[<400a3bf0>] (writeback_inodes+0x0/0xb8) from [<40069b58>]
(wb_kupdate+0xd8/0x160)
r7 = 40438A48 r6 = 00012BED r5 = 00001162 r4 = 8A3E3F30
[<40069a80>] (wb_kupdate+0x0/0x160) from [<4006ab68>] (pdflush+0x128/0x204)
r8 = 40438A48 r7 = 8A3E2000 r6 = 40439750 r5 = 8A3E3F8C
r4 = 8A3E3F80
[<4006aa40>] (pdflush+0x0/0x204) from [<4005461c>] (kthread+0x108/0x134)
[<40054514>] (kthread+0x0/0x134) from [<40042320>] (do_exit+0x0/0x844)
Code: e92dd800 e24cb004 e5900000 e3a01001 (e59030bc)
md: md0 stopped.
md: unbind<sda>
md: export_rdev(sda)
md: unbind<sdd>
md: export_rdev(sdd)
md: unbind<sdc>
md: export_rdev(sdc)
md: unbind<sdb>
md: export_rdev(sdb)

2.6.20-rc3-iop1 on an iop348 platform. SATA controller is sata_vsc.

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/