Re: [test patch] dirty shared mappings (was Re: ... fragmentation)

Eric W. Biederman (ebiederm+eric@npwt.net)
07 Jan 1998 23:37:59 -0600


>>>>> "Linus" == Linus Torvalds <torvalds@transmeta.com> writes:

Linus> On Thu, 8 Jan 1998, Alan Cox wrote:
>> Is there any reason you can't simply invent a device with an mmap method
>> to attach an inode too for anonymous maps so that they cease to be anonymous ?

Linus> That works and is the simple solution, but then you would essentially have
Linus> a _separate_ swap area for anonymous maps. That may be good enough, and
Linus> it's certainly the simple solution.

Linus> The traditional UNIX solution as far as I know is to have the anonymous
Linus> mappings be backed in the general swapfile, and then the problem is one of
Linus> trying to keep track of which entries are used for shared anonymous pages
Linus> and which are used for private anonymous pages (and giving a number to the
Linus> shared anonymous ones).

You can give a number to all of them. The swap is logically contigous
from the internal kernel interface. Of course doing a swapoff to part
of the swap device might be a challenge.

Linus> Having a separate swap area for shared anonymous pages would certainly
Linus> make sense, and would get rid of all the problems, but would require the
Linus> machine to be set up correctly that way by somebody (essentially it turns
Linus> the anonymous shared mappings to a real shared mapping of a special
Linus> "magic" object - which can be either a real file or a partition or
Linus> something, but the point is that the "something" will have to be set up by
Linus> the maintainer of the machine rather than being the global swap device).

Linus> Would that be acceptable to people?

I don't think it will scale. Individual inodes have a small set of
mappings to walk through. Having all shared memory allocations on the
same inode could be nearly as bad as walking the page tables.

A slight modification having a single `inode' per anonymous shared
mapping I think is better. Multiple inodes for a single device. The
only trouble here is you might wind of with a VMA per page due to
fragmentation (which isn't a problem for one time allocation), and VMA
design which assumes continuos mappings.

get_swap_page does a decent job of allocating continous swap pages so
VMA fragemtation shouldn't be too bad. And the more the pages are
actually continous the better usually, because they are related.

Just allocate the swap pages when you allocate the mapping (all at
once). Then never reallocate.

I have a patch that I have been running on 2.0.32 continually for a
couple of weeks. What it does modifies the various kinds of sync to
sync the shared mappings. If anonymous shared mappings get their own
`inode' this could be useful, for writing them. It walks all of the
page tables but it isn't looking for a specific page so this isn't a
problem.

For handling the problem of writing pages too often I have another
patch to allow dirty pages in the page cache. And using writepage to
write them. A dirty bit is kept on the page. The only trick in
2.1.78 is finding a dentry :(

This way the standard write multiple times approach just sets the
dirty bit. And constantly msyncing all of memory (when we sync all
devices) updates that bit to be set apropriately.

I have been playing with all of this stuff in a filesystem I am
writing. It has two main goals one, to allow dirty pages in the page
cache. The other is to allow posix.4 type shared memory mappings.

It would likely not be too difficult to modify it to allocate an
inode, etc for anonymous shared mappings. But this would be overkill
because you don't need to track which logical page is near another.

I have it all stable against 2.0.32 and am porting to 2.1.78 (I have
made the changes needed but haven't removed the bugs introduced :().
And those dentry's for writepage are a nuisance.

When the filesystem works I will have proof my kernel patch works ;)

Just a note: There is currently a race between swapoff and SYSV
shared memory. If the shared memory segment is (a) partly swapped
out, and (b) not mapped at the moment, swapoff won't find it.

I have a patch for swapoff to allow registratioin of functions to call
at swapoff time too.

I have it all at
http://www.npwt.net/~ebiederm/files/shmfs.0.0.020.tar.gz
for my `stable' version.

A version against 2.1.78 should appear in a couple of days, when it
works. I just have this week before schools starts back up and slows
me down :(

Eric