Re: swap cache

Stephen C. Tweedie (sct@redhat.com)
Fri, 18 Dec 1998 13:05:13 GMT


Hi,

On Fri, 18 Dec 1998 13:05:59 -0500 (GMT), Prasun Kapoor
<prasun@wipinfo.soft.net> said:

> Yes, I fully agree with the above description. But there has to be a
> similar arragement for sharing pages of other mmap()ed files. And
> why is the same arrangement not used for SWAP instead of having a
> seperate swap cache.

The same arrangement _is_ used. The swap cache used to be a small
extra side-function in the VM, but the current swap cache uses the
page cache mechanisms, with pages indexed on a reserved swapper_inode
for swap pages. The term "swap cache" refers to that subset of pages,
and the only reason there is a separate term for it is that the old VM
did have a separate mechanism here.

Deep inside the VM, the routes we take when faulting these different
types of page are sometimes very different from each other, and at that
level, the swap cache is definitely a quite distinct beast.

On Fri, 18 Dec 1998 13:01:14 -0500 (GMT), Prasun Kapoor
<prasun@wipinfo.soft.net> said:

> All segments that require sharing (stack, heap et all) are MAP_PRIVATE.
> What fork() does is that it TURNS OFF write permission for all these
> addresses on the PAGE TABLE ( via hat_chgprot() ) for both parent and
> child.

You don't say!

That is _not_ what I'm talking about. Yes, fork() sets up the original
COW. Of course, part of that involves marking the pages readonly. The
question is over what happens during subsequent swapping of the shared page.

> Now if either child or parent faults on any of these address
> ranges, ( as you say much after fork()) they get a fault. This fault is a
> COW fault

I wrote the swap cache. I believe I know what a copy-on-write fault is.

>> No, we can't do that. The swap cache allows us to have multiple
>> processes sharing the same page of swap (think of a process which gets
>> partially swapped out and which then forks). The whole point of loading
>> the pages read-only in the first instance is so that if we do have such
>> sharing of the page, any attempt by one process to write to the page
>> causes a page fault and gives that process a new, private copy of the
>> page.

> The pages are not loaded RO in the first instance. Their page table
> entries are made RO at fork() itself thus taking care of any future
> writes.

They _are_ loaded RO in the first instance. "Loaded" from disk, not
"generated" when the original data is created. I'm talking about
physical pages in memory. When we read in ("load") a page from swap,
then we mark that newly-swapped-in page readonly. There are several
reasons for this:

First of all, it means that any future write to that page can be spotted
by the kernel. In turn, this means that even if the page is fully
faulted back in by all processes referencing the swap entry, we can
still keep the swap entry valid on disk knowing that no processes have
changed the data. This avoids an IO if we want to swap that data back
to disk later: we know the disk is already uptodate. This optimisation
is useful even for unshared pages, where there is no COW as such.

Secondly, if we have a swap page which is shared on-disk, then any one
process which writes to that page takes a COW page fault, and we copy
the data to a new location in memory. In this case, we still have the
original (still-shared) swap page in memory in the swap cache. This
eliminates some rather unpleasant behaviour if you have a process which
is partially swapped out and which regularly creates child processes
(eg, sendmail/apache). Without this behaviour, you'd fork a child, read
in the swap page, modify it in place (because the copy in physical
memory is unshared) and lose the cache of the on-disk entry; so _every_
fork results in a disk access.

> Potential sharers can swap pages IN independently of each other ( and in
> this case page goes to the vnode-offset hash list or as you say swap cache
> to be found by others). But processes cannot swap pages OUT independently
> of each other. Swap out is done on a global scale and all referencing PTEs
> are invalidated when that happens.

Wrong. You've been looking at other Unix source code or books about
other Unixes. Linux does not work this way. Linux supports a number of
VM optimisations not found elsewhere, such as mremap() to map an
existing set of pages elsewhere in the VM (giving kernel support to
realloc(3), for example). This means that a a shared anonymous page may
be found at different VAs in different processes. It is distinctly
non-trivial to find all ptes referencing a given physical page in the
Linux VM, but we don't need that functionality: in Linux we have always
swapped out ptes on a per-page-table basis, not on a per-physical-page
basis. We *DO* swap processes out independently.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/