Re: [PATCH 5/7] mm: page_alloc: Make zone distribution page agingpolicy configurable
From: Johannes Weiner
Date: Tue Dec 17 2013 - 17:57:30 EST
On Tue, Dec 17, 2013 at 09:22:16PM +0000, Mel Gorman wrote:
> On Tue, Dec 17, 2013 at 12:43:02PM -0500, Johannes Weiner wrote:
> > > > > When looking at this closer I found that sysv is a weird exception. It's
> > > > > file-backed as far as most of the VM is concerned but looks anonymous to
> > > > > most applications that care. That and MAP_SHARED anonymous pages should
> > > > > not be treated like files but we still want tmpfs to be treated as
> > > > > files. Details will be in the changelog of the next series.
> > > >
> > > > In what sense is it seen as file-backed?
> > >
> > > sysv and anonymous pages are backed by an internal shmem mount point. In
> > > lots of respects, it's looks like a file and quacks like a file but I expect
> > > developers think of it being anonmous and chunks of the VM treats it like
> > > it's anonymous. tmpfs uses the same paths and they get treated similar to
> > > the VM as anon but users may think that tmpfs should be subject to the
> > > fair allocation zone policy "because they're files." It's a sufficently
> > > weird case that any action we take there should be deliberate. It'll be
> > > a bit clearer when I post the patch that special cases this.
> >
> > The line I see here is mostly derived from performance expectations.
> >
> > People and programs expect anon, shmem/tmpfs etc. to be fast and avoid
> > their reclaim at great costs, so they size this part of their workload
> > according to memory size and locality. Filesystem cache (on-disk) on
> > the other hand is expected to be slow on the first fault and after it
> > has been displaced by other data, but the kernel is mostly expected to
> > maximize the caching effects in a predictable manner.
> >
>
> Part of their performance expectations is that memory referenced from the
> local node will be allocated locally. Consider NUMA-aware applications that
> partition their data usage appropriately and share that data between threads
> using processes and shared memory (some MPI implementations). They have
> an expectation that the memory will be local and a further expectation
> that it will not be reclaimed because they sized it appropriately.
> Automatically interleaving such memory by default will be surprising to
> NUMA aware applications even if NUMA-oblivious applications benefit.
That's exactly why I want to exclude any type of data that is
typically sized to memory capacity. Are we talking past each other?
> Similarly, the pagecache sysctl is documented to affect files, at least
> that's how I wrote it. It's inconsistent to explain that as "the sysctl
> control files, except for tmpfs ones because ...... whatever".
I documented it as affecting by secondary storage cache.
> > The round-robin policy makes the displacement predictable (think of
> > the aging artifacts here where random pages do not get displaced
> > reliably because they ended up on remote nodes) and it avoids IO by
> > maximizing memory utilization.
> >
> > I.e. it improves behavior associated with a cache, but I don't expect
> > shmem/tmpfs to be typically used as a disk cache. I could be wrong
> > about that, but I figure if you need named shared memory that is
> > bigger than your memory capacity (the point where your tmpfs would
> > actually turn into a disk cache), you'd be better of using a more
> > efficient on-disk filesystem.
>
> I am concerned with semantics like "all files except tmpfs files" or
> alternatively regressing performance of NUMA-aware applications and their
> use of MAP_SHARED and sysv.
I'm really not following. MAP_SHARED, sysv, shmem, tmpfs, whatever is
entirely unaffected by my proposal. I never claimed "all files except
tmpfs". It's about what backs the data, which what makes a difference
in people's performance expectation, which makes a difference in how
they size the workloads.
Tmpfs files that may overflow into swap on heavy memory pressure have
an entirely different trade-off than actual cache that is continuously
replaced as part of its size management, and in that sense they are
much closer to anon and sysv shared memory. I don't believe that the
difference between virtual in-core filesystems and actual secondary
storage filesystems is so obscure to users that this behavioral
difference would violate expectations of the term "file".
Is that what you are saying or am I missing something?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/