Re: RFC: from FIBMAP to FIONDEV

Werner Almesberger (almesber@lrc.di.epfl.ch)
Mon, 14 Jun 1999 18:50:20 +0200


Matthew Wilcox wrote:
> On Fri, Jun 11, 1999 at 12:42:15PM +0200, Werner Almesberger wrote:
>> I'm not so sure what to do about RAIDs.

Well, the longer I think about it, the less I'm convinced anything special
should be done for RAIDs, except for issuing a warning and trying to get
things to work in a non-redundant way.

IMHO, a useful "generic" approach is to replicate the boot loader simply
by installing it on a non-RAID partition on each disk of the RAID, and
perhaps to have a means to fail over to another disk if the boot disk is
the one with problems. In the case of LILO, the failover can either be at
the level of picking a different boot image, or it could also be a special
MBR that is run before LILO and that checks for some keypress or such.

Once we have a compact and reliable scheme for detecting the valid copy in
a partially defunct RAID, this can be added to that special MBR.

One thing that may be useful then is to be able to boot a kernel directly
from Linux. That way, kernels can be stored on the RAID and the boot
partitions have a known small size.

Does this sound good to all the RAIDers out there ?

Another thing I've realized is that, for consistency, FIONDEV should also
map partitions to whole disks. I've attached the updated description.

I'm in no particular rush on this one, but given that it takes quite a
while to propagate such an extension, I'd like to get as much right as
possible, so please comment if you see anything wrong.

- Werner

---------------------------------- cut here -----------------------------------

About two years ago, when there was much discussion about reiserfs, md,
and other advanced file system structures, it became apparent that good
old FIBMAP wasn't sufficient. So I proposed a successor for it called
FIPHYLOC. Fortunately, this one never made it into the mainstream
kernel ;-)

Now I have a much improved design. It adds the following new features
(with respect to FIBMAP):

- works with byte granularity, so it can handle files with data starting
at a non-zero offset in blocks (e.g. "block splitting" in reiserfs and
SGI's XFS)
- also returns the device number, so it can handle files spread over
multiple block devices (e.g. md, LVM)
- translates partitions to the whole disk device, so the caller doesn't
have to know the numbering structure
- returns areas of consecutive bytes, not individual block addresses, so
fewer ioctls are required
- can report various anomalies, such as unstable or ambiguous storage
(see below; with FIBMAP, the caller had to know every single special
case)

I've attached a detailed description. I'll work on a kernel patch and the
corresponding extension of LILO once this has stabilized.

Opinions welcome.

- Werner

----- Items to be added to linux/fs.h -----------------------------------------

#define FIONDEV _IOW(0x00,3,struct fiondev_result)
/* get the location of the contiguous area on the underlying block
device, which contains data of the file or block device, starting at
the current file position. */

/*
* Flags returned by FIONDEV
*/

#define FIONDEV_ERROR 1 /* request failed, no data was returned */
#define FIONDEV_VOLATILE 2 /* location is temporary (e.g. scratch file,
RAM disk) */
#define FIONDEV_COPIES 4 /* location is replicated */
#define FIONDEV_MISALIGNED 8 /* current file position it not at a boundary */
#define FIONDEV_FINAL 16 /* "device" is a physical device requiring no
further translation */

struct fiondev_result {
int flags; /* FIONDEV_* flags */
dev_t device; /* underlying block device */
loff_t start; /* byte offset in block device */
loff_t size; /* area size (in bytes); 0 at EOF; may exceed
file or block device size */
};

-------------------------------------------------------------------------------

FIONDEV can be applied to files and to block devices. It returns the
location of data at the current file position on the underlying block
device. The information returned describes a consecutive area on the
block device, determined by its start and its size. FIONDEV advances
the file pointer to the end of this data area.

FIONDEV can be applied recursively to obtain the actual physical
location of a file, e.g. file -> loop device -> (file) -> md (RAID) ->
disk partition -> SCSI disk. "(file)" is implicitly done by the loop
device.

Note that the size returned may actually exceed the size of the file
or the block device being queried. It is the responsibility of the
caller to use only information that corresponds to actually available
data areas. Likewise, the size returned be may less than the actual
contiguous area. (E.g. in the case of consecutive extents.)

If the current file position is at the end of the file or device,
"size" is set to zero. The value of the fields "device" and "start"
is undefined in this case.

Unmapped areas ("holes") should be returned with "device" and "start"
set to zero.

"Whole disk" block devices with no underlying devices should set
FIONDEV_FINAL, return their own device number in the device field,
the current file position in "start", and a value corresponding at
least to the size of the device in "size".

Block devices corresponding to a partition should return the
respective "whole disk" device. This way, the user does not have to
know the device number structure used for partitions.

FIONDEV may return the following flags:

FIONDEV_ERROR
The content of the other fields in the result structure is
undefined. Additional flags may be set to indicate the failure
reason. The current file position may be undefined. No further
FIONDEV ioctls should be issued before closing the file.

FIONDEV_VOLATILE
Location is on unstable storage, e.g. in a designated temporary
file, a RAM disk, etc. Boot loaders must not map such areas.
Files or block devices that are completely unsuitable for mapping
should set FIONDEV_ERROR in addition to FIONDEV_VOLATILE.

FIONDEV_COPIES
Data is contained in multiple areas, e.g. replicated on several
volumes in a RAID array. A boot loader may use this information,
but by doing so, it may not preserve all of the semantics of the
storage. (E.g. in the case of a RAID, the RAID storage itself may
survive the failure of some disks, but a boot loader may fail if
the disks it mapped are among the ones that failed.)

If there are multiple locations at which the physically same data
block can be found, FIONDEV_COPIES should only be set if none of
them can be relied to remain valid. Otherwise, the most long-lived
location should be returned.

FIONDEV_MISALIGNED
The current position is not at the beginning of a contiguous data
area on the underlying block device. Setting this flag is optional.
However, if no data is returned as a result of this condition (i.e.
if FIONDEV_ERROR is set), FIONDEV_MISALIGNED must be set to
indicate the reason.

FIONDEV_FINAL
No further resolution is necessary on the device returned. This
flag must be set by all physical "whole disk" devices when FIONDEV
is invoked on them. It may also be set on the last translation step
leading to a physical "whole disk" device, thereby eliminating a
redundant FIONDEV ioctl.

If the physical location of a file or device changes during mapping,
subsequent FIONDEV ioctls may fail. Note however that extensively
probing for such a condition would be a wasted effort, because changes
could always occur right after mapping.

During migration from FIBMAP to FIONDEV (i.e. for the next few years),
any errno-style errors returned by FIONDEV should be interpreted as lack
of support for FIONDEV. In order to ease migration, FIONDEV should only
return a negative value (i.e. report an errno-style error) for
conditions that cannot be described in more detail using flags.

-- 
  _________________________________________________________________________
 / Werner Almesberger, ICA, EPFL, CH       werner.almesberger@ica.epfl.ch /
/_IN_R_131__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/