Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty

From: Andrea Arcangeli
Date: Thu Oct 21 2004 - 19:43:45 EST


On Thu, Oct 21, 2004 at 04:42:45PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@xxxxxxxxxx> wrote:
> >
> > On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote:
> > > Andrea Arcangeli <andrea@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote:
> > > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an
> > > > > invariant that "pages which are mapped into process address space are
> > > > > always uptodate". We broke that (good) invariant and we're now seeing
> > > > > some fallout. There may be more.
> > > >
> > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps
> > > > reload data from disk without breaking such an invariant.
> > >
> > > There are at least two ways:
> > >
> > > a) Set a new page flag in invalidate, test+clear that at fault time
> >
> > What's the point of adding a new page flag when the invariant
> > !PageUptodate && page_mapcount(page) already provides the information?
>
> Step back and think about this. What earthly sense is there in permitting
> userspace access to non uptodate pages?

this is exactly why new page faults re-read from disk. Istantiating not
uptodate pages is definitely a mistake. But after the pte is istantiated
the uptodate information becomes pointless. It's up to the page fault to
make sure the page is uptodate before mapping it into userspace.

> None. It's completely wrong and the invariant was a good one. We
> broke it
> by introducing some kluge to force new I/O when someone does a new fault
> against the page.

the invariant that is important, is that the page fault must never map
not uptodate pages into userspace. That invariant is still obeyed, it's
just after the page is mapped that we start making good use of the
uptodate bit again.

> (A new PG_needs_rereading flag isn't sufficient btw - we'd also need
> BH_Needs_Rereading and associated code. ug.)

clearing uptodate bits for bh too would fix it. Anyways I doubt in
practice the lack of bh clearing could ever trigger thanks to
mpage_readpages and the page aligned API. Peraphs the 2.6 more relaxed
API could expose the problem, and that's why we need to address it in
2.6.

> I don't get it. invalidate has the pageframe. All it need to do is to
> lock the page, examine mapcount and if it's non-zero, do the shootdown.
> The only way in which we would be performing the shootdown a significant
> number of times would be if someone was repeatedly faulting the thing back
> in anyway, and in that case the physical I/O cost would dominate. Where's
> the performance overhead??
>
> Plus it makes the currently incorrect code correct for existing mmaps.
>
> Plus it avoids the idiotic situation of having non uptodate pages
> accessible to user processes.

peraphs we can implement it for 2.6, but for 2.4 it was not doable, and
clearing PG_uptodate in 2.4 is what made O_DIRECT possible at all, so I
wouldn't call it idiotic situation. It was a big new feature and it
still is, since your new bitflag is not needed: what the VM asks for is
to clear such uptodate bit because the page in the pagecache is not
uptodate anymore. What happens is that the disk has changed under you so
it's idotic to set another bitflag and to leave the uptodate bit set,
when the page is obviously not uptodate anymore.

If you want to shootdown ptes before clearing the bitflag, that's fine
with me, but you will still have to clear the uptodate bitflag on the
page after that, and the mmap coherency has never been provided by
O_DIRECT in linux, this would be a new feature, sure not a bugfix. the
mmap case is a controlled race by design that can cause no harm. btw,
nfs is doing it too in some 2.4 tree, we've got complains that mmaps
weren't refreshed. If you change invalidate_inode_pages2 to shootdown
ptes then they'll get much better refreshing for nfs too, but again, not
doable in the 2.4 backport: in 2.4 all I could do is to clear uptodate
on mapped pages, since after the page is mapped the uptodate bit becomes
useless and we can re-use it. Adding a new bitflag wouldn't change a
thing except to make it more complicated than it already is.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/