Re: Fwd: Control page reclaim granularity

From: Minchan Kim
Date: Mon Mar 12 2012 - 22:51:15 EST


On Mon, Mar 12, 2012 at 11:15:43PM +0800, Zheng Liu wrote:
> On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote:
> > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
> > > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
> > > > Minchan Kim wrote:
> > > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> > > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> > > >>>> I forgot to Ccing you.
> > > >>>> Sorry.
> > > >>>>
> > > >>>> ---------- Forwarded message ----------
> > > >>>> From: Minchan Kim<minchan@xxxxxxxxxx>
> > > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM
> > > >>>> Subject: Re: Control page reclaim granularity
> > > >>>> To: Minchan Kim<minchan@xxxxxxxxxx>, linux-mm<linux-mm@xxxxxxxxx>,
> > > >>>> linux-kernel<linux-kernel@xxxxxxxxxxxxxxx>, Konstantin Khlebnikov<
> > > >>>> khlebnikov@xxxxxxxxxx>, riel@xxxxxxxxxx, kosaki.motohiro@xxxxxxxxxxxxxx
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> > > >>>>> Hi Minchan,
> > > >>>>>
> > > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and
> > > >>>>> linux-kernel
> > > >>>>> mailing list. So please Cc me.
> > > >>>>>
> > > >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I
> > > >>>>> describe the cases I known in our product system. They can be
> > > >>>>> categorized into two cases. One is mmaped all data files into memory
> > > >>>>> and sometime it uses write(2) to append some data, and another uses
> > > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In
> > > >>>>> the
> > > >>>>> second case, the application wants to keep mmaped page into memory
> > > >>>>> and
> > > >>>>> let file pages to be reclaimed firstly. So, IMO, when application
> > > >>>>> uses
> > > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants
> > > >>>>> keep
> > > >>>>> these mmaped pages into memory and do not be reclaimed. At least
> > > >>>>> these
> > > >>>>> pages do not be reclaimed early than file pages. I think that
> > > >>>>> maybe we
> > > >>>>> can recover that routine and provide a sysctl parameter to let the
> > > >>>>> user
> > > >>>>> to set this ratio between mmaped pages and file pages.
> > > >>>>
> > > >>>> I am not convinced why we should handle mapped page specially.
> > > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to
> > > >>>> read
> > > >>>> system call.
> > > >>>> So I think we can't make sure mmaped pages are always win.
> > > >>>>
> > > >>>> My suggestion is that it would be better to declare by user explicitly.
> > > >>>> I think we can implement it by madvise and fadvise's WILLNEED option.
> > > >>>> Current implementation is just readahead if there isn't a page in
> > > >>>> memory
> > > >>>> but I think
> > > >>>> we can promote from inactive to active if there is already a page in
> > > >>>> memory.
> > > >>>>
> > > >>>> It's more clear and it couldn't be affected by kernel page reclaim
> > > >>>> algorithm change
> > > >>>> like this.
> > > >>>
> > > >>> Thank you for your advice. But I still have question about this
> > > >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED
> > > >>> option, it will cause an inconsistently status for pages that be
> > > >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call
> > > >>> madvise with WILLNEED flag, some pages will be moved into active list if
> > > >>> they already have been in memory, and other pages will be read into
> > > >>> memory and be saved in inactive list if they don't be in memory. Then
> > > >>> pages that are in inactive list are possible to be reclaim. So from the
> > > >>> view of users, it is inconsistent because some pages are in memory and
> > > >>> some pages are reclaimed. But actually the user hopes that all of pages
> > > >>> can be kept in memory. IMHO, this inconsistency is weird and makes
> > > >>> users
> > > >>> puzzled.
> > > >>
> > > >> Now problem is that
> > > >>
> > > >> 1. User want to keep pages which are used once in a while in memory.
> > > >> 2. Kernel want to reclaim them because they are surely reclaim target
> > > >> pages in point of view by LRU.
> > > >>
> > > >> The most desriable approach is that user should use mlock to guarantee
> > > >> them in memory. But mlock is too big overhead and user doesn't want to
> > > >> keep
> > > >> memory all pages all at once.(Ie, he want demand paging when he need
> > > >> the page)
> > > >> Right?
> > > >>
> > > >> madvise, it's a just hint for kernel and kernel doesn't need to make
> > > >> sure madvise's behavior.
> > > >> In point of view, such inconsistency might not be a big problem.
> > > >>
> > > >> Big problem I think now is that user should use madvise(WILLNEED)
> > > >> periodically because such
> > > >> activation happens once when user calls madvise. If user doesn't use
> > > >> page frequently after
> > > >> user calls it, it ends up moving into inactive list and even could be
> > > >> reclaimed.
> > > >> It's not good. :-(
> > > >>
> > > >> Okay. How about adding new VM_WORKINGSET?
> > > >> And reclaimer would give one more round trip in active/inactive list
> > > >> erwhen reclaim happens
> > > >> if the page is referenced.
> > > >>
> > > >> Sigh. We have no room for new VM_FLAG in 32 bit.
> > > > p
> > > > It would be nice to mark struct address_space with this flag and export
> > > > AS_UNEVICTABLE somehow.
> > > > Maybe we can reuse file-locking engine for managing these bits =)
> > >
> > > Make sense to me. We can mark this flag in struct address_space and check
> > > it in page_refereneced_file(). If this flag is set, it will be cleard and
> >
> > Disadvantage is that we could set reclaim granularity as per-inode.
> > I want to set it as per-vma, not per-inode.
>
> I don't think this is a disadvantage. This per-inode reclaim
> granularity is useful for us. Actually I have thought to implement a
> per-inode memcg to let different file sets to be reclaimed separately.
> So maybe we can provide two mechanisms to let the user to choose how to
> use them.

I don't oppose supporting both mechanism but I don't want to give only per-inode
approach.

>
> >
> > > the function returns referenced > 1. Then this page can be promoted into
> > > activate list. But I prefer to set/clear this flag in madvise.
> >
> > Hmm, My idea is following as,
> > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list
> > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which
> > are set by new VM flag and the page is referenced recently at least once.
> > It means it gives one more round trip in his list(ie, active/inactive list)
> > rather than activation so that the page would become less reclaimable.
>
> No matter what the page is given one more round trip or is promoted into
> active list, it can satisfy our current requirement. So now the
> question is which is better. If we add a new VM flag, as you said
> before, vma->vm_flags has no room for it in 32 bit. I have noticed that
> this topic has been discussed [1] and the result is that vm_flags is
> still a unsigned long type. So we need to use a tricky technique to solve
> it. If we add a new flag in struct addpress_space, it might be easy to
> implement it.

In case of per-inode, it's good but it doesn't work for per-vma and file-range.

>
> 1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html
>
> Regards,
> Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/