I'm messing around with shared anonymous mappings at the moment, although
I'm going back finish off page-colouring before continuing.
The idea is to introduce anonymous-inodes.
When a anonynmous mapping is created, a dummy dentry and inode are
attached to the vm_area_struct with the vm-area linked into the inode's
share ring.
An inode is extended to contain a "struct anon_inode_info" structure
unioned with the other fs specific structures in an inode structure, and a
delete operation is added to the dentry.
The "struct anon_inode_info" contains a memory area which describes where
the pages for the vm_area are;
1) In-core (a struct page *).
2) On-swap.
3) Not yet allocated.
Extend the vm_operations structure to contain a "dup" function, which is
called for each area during a fork. For shared anonymous areas, this
dup() function increases the reference count to the area's dentry.
When the area is closed, the dentry ref-count is decremented. If this
becomes zero, then the dentry's delete operation is invoked, which cleans
up the "anon_inode_info" structure.
For shared-anon areas, the ZERO_PAGE _cannot_ be used (well, not
efficiently with the current design of the MM sub-system). Example;
1) Task A creates an anonymous mapping
2) Task A forks to create Task B
3) Task A reads from an unload page in the anon-area
4) Task B then writes to the same location Task A
has read from.
If, at step 3), the zero-page is loaded, then when B write-protect faults
at step 4), the PTE in Task A will need to be updated to refer to the new
page. Basically, a loaded PTE in aother context needs to be updated
because a page has been moved.
It would be possible to go to the anon-inode for the area, and walk along
the shared ring of vm-areas updating the PTEs for each task.. But this is
time consuming, and would introduce a few races. The races are not too
difficult to handle, but I'd rather avoid them - at least for the time
being.
The design of the "anon_inode_info" structure is a challenge.
It obvious needs an array, indexed via "vm_start-vm_end+vm_offset".
The offset is needed for when tasks which are sharing the anonymous area,
and one unmaps part of the area (the unmapping will create one or two new
vm-areas, which will have a reference to the orignal dentry).
I do not intend to support merging is adjacent, shared, anonymous areas.
The design of the structure is also dependent on how swapping for the
shared areas is to be supported. Ideally, we do not want to write out a
loaded page for each contexted that has it mapped.
Instead, in try_to_swap_out(), a PTE which is found to reference a loaded
page is simply cleared (zeroed). When another scanning function, which is
dedicated to handling shared areas, finds the page no longer has any
loaded PTEs the page can be written to swap. I _do_not_ like this, but it
will do for now (there is swap-page caching, which I won't go into here).
So "anon_node_info" contains an array of "struct page" ptrs. If, for a
given index, the ptr is NULL it means the page is on swap. When it is
non-NULL, the page->count is the number of loaded PTEs which refer to the
page, plus 1. The "plus 1" is for the "anon_node_info" reference.
Of course, when the page ptr is NULL, there needs to be away to find where
on swap the page is held. So the structure also contains an array of
"unsigned long"s, which are the swap 'entires'. If the 'entry' is zero,
it means the page has _never_ been loaded, so a zero-filled page is needed
on first access (we could also do eager swap-space allocation here).
Now, a pointer and an unsigned long for each page is a bit fat. They
could be union-ed together, with a bitmap indicating what the value
represents...I'll have more idea when I get a working model to play with.
Yes, it is possible to build "Shm memory" onto this - which is what I want
to do. It is also possible to use this for private, anonymous mappings.
(vm_ops->dup(), which will be used during a fork(), would create a new
dentry/inode and copy the "anon_inode_info" - or it could be copied on the
first fault to the area, but this would be ugly).
To be complete, "anon_inode_info" would also need to be used for private,
named, mappings (this create an anonymous page upon the first
write-access). This gets non-trivial. For starters mappings already have
a dentry indicating their load-store (the named-file). A vm_area_struct
would need to be extended to have a load-store dentry and a backing-store
dentry.
During a not-present fault on a privated, named, mapping, the backing
store dentry would first need to be check to see if the page is already
loaded, or is on swap. (Note: A page may already be loaded if
try_to_swap_out() has unloaded the PTE but has not yet written the page to
swap). If the backing-store dentry didn't contain the page, then the
front-store would be checked (which is the existing page-cache). If not
present in this store, the page would be loaded from the front-store's
dentry() via i_op->readpage().
Moving the location of swap-page out of the PTE, and into a structure
associated with a dentry, has both advantages and dis-advantages.
One of the advantages would be the ability of releasing a page-table after
all the PTEs have been unloaded - as the PTEs in the table are no longer
requried to locate a swap-page.
A couple of dis-advantages are increased memory usage (for the extra
structure) and slightly longer code paths in some cases (but not all
cases).
If it could _always_ be guaranteed that a swap area is present, then a lot
of details could be pushed down into the swap-area management layer.
Infact, the swap code could be made into a true file-system - mapping the
existing file-system operation function ptrs into swap-system operation
functions. Why this sounds nice-and-pretty, it can be horribly
inefficient.
There are likely to be mistakes in the above. Please simply take this as
a quick brain dump - not a design document!
Back to the orignal point, of shared, anonymous, mappings. I've already
re-arranged the code under /mm to get it "into shape (or rather, how I
like it)". I've got to finish off page-colouring this week, but I'll
should get a working implementation (but not the _correct_ implmentation)
out by the end of the week.
markhe