Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3

From: Neil Brown
Date: Sun Aug 17 2003 - 18:18:05 EST

Next message: Erik Nygren: "High interrupt load with ICH5 and ide-scsi & no mounted fs (on 2.4.21)"
Previous message: Duraid Madina: "kswapd is having a party"
In reply to: Mike Fedyk: "Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3"
Next in thread: Mike Fedyk: "Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sunday August 17, mfedyk@xxxxxxxxxxxxx wrote:
> On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> > On Saturday August 16, mfedyk@xxxxxxxxxxxxx wrote:
> > > I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> > > drives in a linear md "array".
> > >
> > > I'm guessing this will hit this bug too?
> >
> > This should be safe. raid5 only ever submits 1-page (4K) requests
> > that are page aligned, and linear arrays will have the boundary
> > between drives 4k aligned (actually "chunksize" aligned, and chunksize
> > is atleast 4k).
> >
>
> So why is this hitting with raid0? Is lvm2 on top of md the problem and md
> on lvm2 is ok?
>

The various raid levels under md are in many ways quite independent.
You cannot generalise about "md works" or "md doesn't work", you have
to talk about the specific raid levels.

The problem happens when
1/ The underlying device defines a merge_bvec_fn, and
2/ the driver (meta-device) that uses that device
2a/ does not honour the merge_bvec_fn, and
2b/ passes on requests larger than one page.

md/linear, md/raid0, and lvm all define a merge_bvec_fn, and so can be
a problem as an underlying device by point (1).

md/* and lvm do not honour merge_bvec_fn and so can be a problem as a
meta-device by 2a.
However md/raid5 never passes on requests larger than one page, so it
escapes being a problem by point 2b.

So the problem can happen with
md/linear, md/raid0, or lvm being a component of
md/linear, md/raid0, md/raid1, md/multipath, lvm.

(I have possibly oversimplified the situation with lvm due to lack of
precise knowledge).

I hope that clarifies the situation.

> > So raid5 should be safe over everything (unless dm allows striping
> > with a chunk size less than pagesize).
> >
> > Thinks: as an interim solution of other raid levels - if the
> > underlying device has a merge_bvec_function which is being ignored, we
> > could set max_sectors to PAGE_SIZE/512. This should be safe, though
> > possibly not optimal (but "safe" is trumps "optimal" any day).
>
> Assuming that sectors are always 512 bytes (true for any hard drive I've
> seen) that will be 512 * 8 = one 4k page.
>
> Any chance sector != 512?

No. 'sector' in the kernel means '512 bytes'.
Some devices might request requests to be at least 2 sectors long and
have even sector addresses because they have physical sectors that are
1K, but the parameters like max_sectors are still in multiples of 512
bytes.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Erik Nygren: "High interrupt load with ICH5 and ide-scsi & no mounted fs (on 2.4.21)"
Previous message: Duraid Madina: "kswapd is having a party"
In reply to: Mike Fedyk: "Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3"
Next in thread: Mike Fedyk: "Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]