Re: [RFC][PATCHSET] mremap/mmap mess

From: Hugh Dickins
Date: Tue Dec 08 2009 - 06:43:21 EST


On Tue, 8 Dec 2009, Al Viro wrote:
>
> Why do we want user_get_pages(), anyway? It's not that we lacked an
> easy way to do large arrays, especially since the use is purely sequential.
> Even a linked list of vmalloc'ed pages would do just fine (i.e. start with
> static array in bprm, keep the pointer to last filled entry + number of
> entries left before the next allocation; use the last pointer in array
> for finding the next page-sized chunk).
>
> What do we lose if we go that way? Inserting all these pages into mm
> at once shouldn't be slower. Memory overhead is not really an issue
> (one page per 511 or 1023 pages of argv). Am I missing something?

I think what you lose that way is swappability.

Since we're supporting unlimited args and env here, it is important
that those pages can belong to an mm, be discoverable by rmap, and
be swapped out if really necessary. Whereas I think you're proposing
an internal list of those pages, unknown to rmap, unswappable.

Of course, a page is pinned in core between get_user_pages() and
put_page(), but unless I've got it wrong, get_user_pages() is being
applied one by one to these pages, each unpinned as the next is pinned.

It is conceivable that it actually doesn't work in the way I'm
expecting: that although it's all designed to leave those pages
swappable, some mod here or there has interfered with that. But if
so, that would be a bug: the intention, and I believe it's important,
is that those pages are swappable.

I have a different reason for wanting to change how it's done:
it's the major user of non-atomic kmap() and its global kmap_lock,
and rather swamps other uses of kmap() (which have better use for
the cached virtual address). So I'd be happy with a better way
of doing it, but not at the cost of losing swappability.

Hugh

>
> The benefit, AFAICS, is that we get rid of the mess with forced high
> address use, get *sane* get_user_pages() (we always have matching
> task_struct with the right personality, so we can avoid massive PITA
> for doing checks right) and we get unified mmu/nommu code in fs/exec.c
> out of that.
>
> If you see serious problems I've missed, please tell. Otherwise I'm
> going to hack up a prototype and post it here...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/