Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

From: Matt Mackall
Date: Mon Mar 19 2007 - 18:45:37 EST


On Mon, Mar 19, 2007 at 10:05:29PM +0100, Thomas Gleixner wrote:
> On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
> > > (UBI also has static volumes which LVM doesn't but that is an aside.)
> >
> > If a static volume is simply a non-dynamic volume, then device mapper
> > can do that too. And countless other things. Which is not an aside.
> > UBI growing to do all the things that device mapper does is exactly
> > the thing we should be seeking to avoid.
>
> No it can't and device mapper sits on top of block devices. FLASH is no
> block device. Period.

Which of the following two properties does it lack?

- discrete blocks
- non-sequential access to blocks

When you do the obvious s/blocks/eraseblocks/, this appears to be
true.

Saying "but I can't do I/O smaller than the blocksize" doesn't change
this any more than it would for disks.

Saying "but I can do smaller I/O efficiently in some circumstances"
also doesn't change it.

In historical UNIX, some tapes were block devices too. Because they
supported seek().

> Device mapper can not provide a simple easy to decode scheme for boot
> loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
> and be able to find the kernel or second stage boot loader in this
> unordered device.
>
> And no, fixed addresses do not work. Do you want to implement device
> mapper into your Initialial Bootloader stage ?

This is exactly the same problem as booting on a desktop PC. But
somehow LILO manages. My first Linux box had a hell of a lot less disk
than the platform I bootstrapped (and wrote NAND drivers for) last
month had in NAND.

> > > That's why I suggested fixing the MTD layers that present block devices
> > > first in the part of my reply that you cut off. It seems to me that
> > > you're really after getting flash to look like a block device, which
> > > would enable device mapper to be used for something similar to UBI.
> > > That's fine, but until someone does that work UBI fills a need, has
> > > users, and has an existing implementation.
> >
> > False starts that get mainlined delay or prevent things getting done
> > right. The question is and remains "is UBI the right way to do
> > things?" Not "is UBI the easiest way to do things?" or "is UBI
> > something people have already adopted?"
> >
> > If the right way is instead to extend the block layer and device
> > mapper to encompass the quirks of NAND in a sensible fashion, then UBI
> > should not go in.
>
> No, block layer on top of FLASH needs 80% of the functionality of UBI in
> the first place.

Incorrect. A block-based filesystem on top of flash needs this
functionality. But a block device suitable to device mapper layering
(which then provides the functionality) does not.

> You need to implement a clever journalling block device
> emulator in order to keep the data alive and the FLASH not weared out
> within no time. You need the wear levelling, otherwise you can throw
> away your FLASH in no time.

And that's why it's in my picture.

> > Let me draw a picture so we have something to argue about:
> >
> > iSCSI/nbd(6)
> > |
> > filesystem { swap | ext3 ext3 jffs2
> > \ | | | /
> > / \ | dm-crypt->snapshot(5) /
> > device mapper -| \ \ | /
> > | partitioning /
> > | | partitioning(4)
> > | wear leveling(3) /
> > | | /
> > | block concatenation
> > | | | | |
> > \ bad block remapping(2)
> > | | | |
> > MTD raw block { raw block devices with no smarts(1)
> > / | \ \
> > hardware { NAND NAND NAND NAND
> >
> > Notes:
> > 1. This would provide a block device that allowed writing pages and
> > a secondary method for erasing whole blocks as well as a method for
> > querying/setting out of band information.
>
> Forget about OOB data. OOB data is reserved for ECC. Please read the
> recommendations of the NAND FLASH manufacturers. NAND gets less reliable
> with higher density devices and smaller processes.
>
> > 2. This would hide erase blocks either by using an embedded table or
> > out of band info. This could stack on top of block concatenation if
> > desired.
>
> Hide erase blocks ? UBI does not hide anything. It maps logical
> eraseblocks, which are exposed to the clients to arbitrary physical
> eraseblocks on the FLASH device in order to provide across device wear
> levelling.

Sorry, I meant hiding bad blocks here. That's why this layer was
labeled "bad block remapping".

> > 3. This would provide wear leveling, and probably simultaneously
> > provide relatively efficient and safe access to write sector
> > and page-sized I/O. Below this level, things had better be
> > comfortable with the limitations of NAND if they want to work well.
>
> I don't see how this provides across device wear levelling.

Because the layer immediately beneath it ("block concatenation") takes
N devices and presents one logical device.

> > 4. JFFS2 has its own wear-leving scheme, as do several other
> > filesystems, so they probably want to bypass this piece of the stack.
>
> JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
> wear levelling sucks.

Ok, fine. How about LogFS, then?

> > 5. We don't reimplement higher pieces of the stack (dm-crypt,
> > snapshot, etc.).
>
> Why should we reimplement that ?

So that you can get encryption and snapshot, etc.?

> > 6. We make some things possible that simply aren't otherwise.
> >
> > And this picture isn't even interesting yet. Imagine a dm-cache layer
> > that caches data read from disks in high-speed flash. Or using
> > dm-mirror to mirror writes to local flash over NBD or to a USB drive.
> > Neither of these can be done 'right' in a stack split between device
> > mapper and UBI.
>
> Err. Implement a clever block layer on top of UBI and use all the
> goodies you want including device mapper.

If I wanted to have both device mapper and device mapper's little
brother in my kernel, I wouldn't have started this thread.

--
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/