Heinz, Andrea, Linus,
various ideas/patches regarding block device stacking support were
floating around in the last couple of days, here is a patch against
vanilla 2.3.47 that solves both RAID's and LVM's needs sufficiently:
http://www.redhat.com/~mingo/raid-patches/raid-2.3.47-B6
(also attached) Andrea's patch from yesterday touches some of the issues
but RAID has different needs wrt. ->make_request():
- RAID1 and RAID5 needs truly recursive ->make_request() stacking because
the relationship between the request-bh and the IO-bh is not 1:1. In the
case of RAID0/linear and LVM the mapping is 1:1, so no on-stack
recursion is necessery.
- re-grabbing the device queue in generic_make_request() is necessery,
just think of RAID0+LVM stacking.
- IO-errors have to be initiated in the layer that notices them.
- i dont agree with moving the ->make_request() function to be
a per-major thing, in the (near) future i'd like to implement RAID
personalities via several sub-queues of a single RAID-blockdevice,
avoiding the current md_make_request internal step completely.
- renaming ->make_request_fn() to ->logical_volume_fn is both misleading
and unnecessery.
i've added the good bits (i hope i found all of them) from Andrea's patch
as well: the end_io() fix in md.c, the ->make_request() change returning
IO errors, and avoiding an unnecessery get_queue() in the fast path.
the patch changes blkdev->make_request_fn() semantics, but these work
pretty well both for RAID0, LVM & RAID1/RAID5:
(bh->b_dev, bh->b_blocknr) => just like today, never modified, this is
the 'physical index' of the buffer-cache.
internally any special ->make_request() function is forbidden to access
b_dev and b_blocknr too, b_rdev and b_rsector has to be used.
ll_rw_block() correctly installs an identity mapping first, and all
stacked devices just iterate one more step.
bh->b_rdev: the 'current target device'
bh->b_rsector: the 'current target sector'
the return values of ->make_request_fn():
ret == 0: dont continue iterating and dont submit IO
ret > 0: continue iterating
ret < 0: IO error (already handled by the layer which noticed it)
we explicitly rely on ll_rw_blk getting the BH_Lock and not calling
->make_request() on this bh more than once.
with these semantics all the variations are possible, it's up to the
device to use the one it likes best:
- device resolves one mapping step and returns 1 (RAID0, LVM)
- device calls generic_make_request() and return 1 (RAID1, RAID5)
- device resolves recursion internally and returns 0 (future RAID0),
returns 1 if recursion cannot be resolved internally.
generic_make_request() returns 0 if it has submitted IO - thus
generic_make_request() can also be used as a queue's ->make_request_fn()
function - it's completely symmetric. (not that anyone would want to do
this)
NOTE: a device might still resolve stacking internally, if it can. Eg. the
next version of raid0.c will do a while loop internally if we map
RAID0->RAID0. The performance advantage is obvious: no indirect function
calls and no get_queue(). LVM could do the same as well.
(the patch modifies lvm.c to reflect these new semantics, to not rely on
b_dev and b_blocknr and to not call generic_make_request(), and fixes the
lvm.c hack avoiding MD<->LVM stacking. These changes are untested.)
with this method it was pretty straightforward to add stacked RAID0 and
linear device support, here is a sample RAID0+RAID0 => RAID0 stacking:
[root@moon /root]# cat /proc/mdstat
Personalities : [linear] [raid0]
read_ahead 1024 sectors
md2 : active raid0 mdb[1] mda[0]
1661472 blocks 4k chunks
md1 : active raid0 sdf1[1] sde1[0]
830736 blocks 4k chunks
md0 : active raid0 sdd1[1] sdc1[0]
830736 blocks 4k chunks
unused devices: <none>
[root@moon /root]# df /mnt
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/md2 1607473 13 1524387 0% /mnt
The LVM changes are not tested. The RAID0/linear changes compile/boot/work
just fine and are reasonably well-tested and understood.
any objections?
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Wed Feb 23 2000 - 21:00:33 EST