Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

From: Dave Chinner
Date: Tue Dec 11 2018 - 01:18:57 EST


On Sat, Dec 08, 2018 at 10:09:26AM -0800, Dan Williams wrote:
> On Sat, Dec 8, 2018 at 8:48 AM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> >
> > On Sat, Dec 08, 2018 at 11:33:53AM -0500, Jerome Glisse wrote:
> > > Patchset to use HMM inside nouveau have already been posted, some
> > > of the bits have already made upstream and more are line up for
> > > next merge window.
> >
> > Even with that it is a relative fringe feature compared to making
> > something like get_user_pages() that is literally used every to actually
> > work properly.
> >
> > So I think we need to kick out HMM here and just find another place for
> > it to store data.
> >
> > And just to make clear that I'm not picking just on this - the same is
> > true to a just a little smaller extent for the pgmap..
>
> Fair enough, I cringed as I took a full pointer for that use case, I'm
> happy to look at ways of consolidating or dropping that usage.
>
> Another fix that may put pressure 'struct page' is resolving the
> untenable situation of dax being incompatible with reflink, i.e.
> reflink currently requires page-cache pages. Dave has talked about
> silently establishing page-cache entries when a dax-page is cow'd for
> reflink,

I think you've got it the wrong way around there :)

Think of a set of files with the following physical block mappings:

index 0 1 2 3 4 5
inode W A B C D E F
inode X B C D E F A
inode Y C D E F A B
inode Z D E F A B C

Basically, each block has 4 references (one from each file), and
each reference to a block is from a diffent file offset. Now, with
DAX, each inode wants to put the same struct page into their own
address space mapping tree but have different page indexes.

i.e. for block A, inode W wants page->index = 0, X wants 5, Y wants
4 and Z wants 3.

This is not possible with a single struct page and where the
problem with DAX, struct pages and physically shared data lies.

This is where the page cache is currently required - each mapping
gets it's own copy of the shared block in volatile RAM, but when
sharing is broken (by COW) we can toss the volatile copy and go back
to using DAX for the newly allocated, single owner {block, struct
page} tuple that replaces the shared page.

> but I wonder if we could go the other way and introduce the
> mechanism of a page belonging to multiple mappings simultaneously and
> managed by the filesystem.

That's pretty much what I suggested at LSFMM. We do lookups for
shared extent mappings through the filesystem buffer cache (which is
indexed by physical location) and hold the primary struct page in
the filesystem buffer cache. We then hand out dynamically allocated
struct pages back to the caller that point to the same physical page
and place them in each inode's address space. When a write fault
occurs, we allocate a new block, grab the physical struct page, copy
the data across, and release the dynamically allocated read-only
struct page and reference to the primary struct page held in the
filesytem buffer cache.

It's essentially the same model "cached page per inode address
space" as using volatile RAM copies via the page cache, except
the struct pages point back to the same physical location rather
than having their own temporary, volatile copy of the data.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx