Re: NVM Mapping API

From: Boaz Harrosh
Date: Fri May 18 2012 - 06:13:19 EST


On 05/18/2012 12:03 PM, James Bottomley wrote:

> On Thu, 2012-05-17 at 14:59 -0400, Matthew Wilcox wrote:
>> On Thu, May 17, 2012 at 10:54:38AM +0100, James Bottomley wrote:
>>> On Wed, 2012-05-16 at 13:35 -0400, Matthew Wilcox wrote:
>>>> I'm not talking about a specific piece of technology, I'm assuming that
>>>> one of the competing storage technologies will eventually make it to
>>>> widespread production usage. Let's assume what we have is DRAM with a
>>>> giant battery on it.
>>>>
>>>> So, while we can use it just as DRAM, we're not taking advantage of the
>>>> persistent aspect of it if we don't have an API that lets us find the
>>>> data we wrote before the last reboot. And that sounds like a filesystem
>>>> to me.
>>>
>>> Well, it sounds like a unix file to me rather than a filesystem (it's a
>>> flat region with a beginning and end and no structure in between).
>>
>> That's true, but I think we want to put a structure on top of it.
>> Presumably there will be multiple independent users, and each will want
>> only a fraction of it.
>>
>>> However, I'm not precluding doing this, I'm merely asking that if it
>>> looks and smells like DRAM with the only additional property being
>>> persistency, shouldn't we begin with the memory APIs and see if we can
>>> add persistency to them?
>>
>> I don't think so. It feels harder to add useful persistent
>> properties to the memory APIs than it does to add memory-like
>> properties to our file APIs, at least partially because for
>> userspace we already have memory properties for our file APIs (ie
>> mmap/msync/munmap/mprotect/mincore/mlock/munlock/mremap).
>
> This is what I don't quite get. At the OS level, it's all memory; we
> just have to flag one region as persistent. This is easy, I'd do it in
> the physical memory map. once this is done, we need either to tell the
> allocators only use volatile, only use persistent, or don't care (I
> presume the latter would only be if you needed the extra ram).
>
> The missing thing is persistent key management of the memory space (so
> if a user or kernel wants 10Mb of persistent space, they get the same
> 10Mb back again across boots).
>
> The reason a memory API looks better to me is because a memory API can
> be used within the kernel. For instance, I want a persistent /var/tmp
> on tmpfs, I just tell tmpfs to allocate it in persistent memory and it
> survives reboots. Likewise, if I want an area to dump panics, I just
> use it ... in fact, I'd probably always place the dmesg buffer in
> persistent memory.
>
> If you start off with a vfs API, it becomes far harder to use it easily
> from within the kernel.
>
> The question, really is all about space management: how many persistent
> spaces would there be. I think, given the use cases above it would be a
> small number (it's basically one for every kernel use and one for ever
> user use ... a filesystem mount counting as one use), so a flat key to
> space management mapping (probably using u32 keys) makes sense, and
> that's similar to our current shared memory API.
>
>>> Imposing a VFS API looks slightly wrong to me
>>> because it's effectively a flat region, not a hierarchical tree
>>> structure, like a FS. If all the use cases are hierarchical trees, that
>>> might be appropriate, but there hasn't really been any discussion of use
>>> cases.
>>
>> Discussion of use cases is exactly what I want! I think that a
>> non-hierarchical attempt at naming chunks of memory quickly expands
>> into cases where we learn we really do want a hierarchy after all.
>
> OK, so enumerate the uses. I can be persuaded the namespace has to be
> hierarchical if there are orders of magnitude more users than I think
> there will be.
>
>>>>> Or is there some impediment (like durability, or degradation on rewrite)
>>>>> which makes this unsuitable as a complete DRAM replacement?
>>>>
>>>> The idea behind using a different filesystem for different NVM types is
>>>> that we can hide those kinds of impediments in the filesystem. By the
>>>> way, did you know DRAM degrades on every write? I think it's on the
>>>> order of 10^20 writes (and CPU caches hide many writes to heavily-used
>>>> cache lines), so it's a long way away from MLC or even SLC rates, but
>>>> it does exist.
>>>
>>> So are you saying does or doesn't have an impediment to being used like
>>> DRAM?
>>
>> >From the consumers point of view, it doesn't. If the underlying physical
>> technology does (some of the ones we've looked at have worse problems
>> than others), then it's up to the driver to disguise that.
>
> OK, so in a pinch it can be used as normal DRAM, that's great.
>
>>>>> Alternatively, if it's not really DRAM, I think the UNIX file
>>>>> abstraction makes sense (it's a piece of memory presented as something
>>>>> like a filehandle with open, close, seek, read, write and mmap), but
>>>>> it's less clear that it should be an actual file system. The reason is
>>>>> that to present a VFS interface, you have to already have fixed the
>>>>> format of the actual filesystem on the memory because we can't nest
>>>>> filesystems (well, not without doing artificial loopbacks). Again, this
>>>>> might make sense if there's some architectural reason why the flash
>>>>> region has to have a specific layout, but your post doesn't shed any
>>>>> light on this.
>>>>
>>>> We can certainly present a block interface to allow using unmodified
>>>> standard filesystems on top of chunks of this NVM. That's probably not
>>>> the optimum way for a filesystem to use it though; there's really no
>>>> point in constructing a bio to carry data down to a layer that's simply
>>>> going to do a memcpy().
>>>
>>> I think we might be talking at cross purposes. If you use the memory
>>> APIs, this looks something like an anonymous region of memory with a get
>>> and put API; something like SYSV shm if you like except that it's
>>> persistent. No filesystem semantics at all. Only if you want FS
>>> semantics (or want to impose some order on the region for unplugging and
>>> replugging), do you put an FS on the memory region using loopback
>>> techniques.
>>>
>>> Again, this depends on use case. The SYSV shm API has a global flat
>>> keyspace. Perhaps your envisaged use requires a hierarchical key space
>>> and therefore a FS interface looks more natural with the leaves being
>>> divided memory regions?
>>
>> I've really never heard anybody hold up the SYSV shm API as something
>> to be desired before. Indeed, POSIX shared memory is much closer to
>> the filesystem API;
>
> I'm not really ... I was just thinking this needs key -> region mapping
> and SYSV shm does that. The POSIX anonymous memory API needs you to
> map /dev/zero and then pass file descriptors around for sharing. It's
> not clear how you manage a persistent key space with that.
>
>> the only difference being use of shm_open() and
>> shm_unlink() instead of open() and unlink() [see shm_overview(7)].
>> And I don't really see the point in creating specialised nvm_open()
>> and nvm_unlink() functions ...
>
> The internal kernel API addition is simply a key -> region mapping.
> Once that's done, you need an allocation API for userspace and you're
> done. I bet most userspace uses will be either give me xGB and put a
> tmpfs on it or give me xGB and put a something filesystem on it, but if
> the user wants an xGB mmap'd region, you can give them that as well.
>
> For a vfs interface, you have to do all of this as well, but in a much
> more complex way because the file name becomes the key and the metadata
> becomes the mapping.
>


Matthew is making very good points, and so does James. For one the very
strong point is "why not use NVM in an OOM situation, as a NUMA slower
node?"

I think the best approach is both, and layered.

0. An NVM Driver

1. Well define, and marry, the notion of "persistent memory" into
the Memory mode. Layers, speeds, and everything. Now you have one
or more flat regions of NVM.

So this is just one or more NVM memory zones, persistent being
a property of a zone.

2. Define a new NvmFS, which is like the RamFS we have today
that uses page_cach semantics and is in bed with the page-allocators
This layer gives you the key-to-buffer management as well as just
transparent POSIX API to existing applications.

Layers 1, 2 can be generic, if Layer 0 is well parametrized.

There might be a layer 2.5, where similar to a Partition, you
have a flat UUIed sub-region for the likes of Kernel subsystems
The NvmFS layer is mounted on an allocated UUIDed region, but also
a SWAP space a Journal, what ever hybrid idea anyone has.

> James
>


Because you see. I like and completely agree with what Matthew
said, and I want it.

But I also want all of what James said.
nvm_kalloc(struct uuid *uuid, size_t size, gfp);
(A new uuid is created but an existing one returns
it. And we might want to open exclusive/shared and
stuff)

Just my $0.017
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/