Re: Ideas for reducing memory copying and zeroing times

Jamie Lokier (jamie@rebellion.co.uk)
Wed, 17 Apr 96 02:55 BST


>>>>> "Werner" == Werner Almesberger <almesber@lrc.epfl.ch> writes:

Werner> For slightly different wording, you may also want to read
Werner> ftp://lrcftp.epfl.ch/pub/linux/atm/papers/atm_on_lowend.ps.gz
Werner> or have a look at net/atm/mmuio.c in
Werner> ftp://lrcftp.epfl.ch/pub/linux/atm/dist/atm-0.10.tar.gz

Werner> See http://lrcwww.epfl.ch/linux-atm/ for the general
Werner> picture.

I shall do my best. I hope your WWW pages look good in Lynx! :-)
I shall certainly read your paper and source (when I get the time...).

Werner> Also, my code only works with the ATM side of networking -
Werner> the whole IP stack is a bit more complex, but I think Alan
Werner> has started working on that.

It sounds like you have networking specifically in mind. Certainly,
file serving and routing are obvious applications to benefit from this
(they're what I had in mind too), but also general I/O in all
applications is likely to benefit. There is a potential reduction in
swapping too (see "Swapping Benefits").

I've noticed that when performance becomes important, people start
adding `mmap' optimisations to their programs. INND springs to mind.
And all my non-networking major I/O programs prefer to use `mmap' :-).
Also there's the new `mmap' capability in the sound driver. Page
sharing would reduce the need for that sort of thing, with the remaining
major bottleneck being the need for `read' to wait for the status of the
read operation. Even that could be eliminated when it is acceptable for
an application to receive a SEGV when it tries to access a page that it
thought it had read successfully, but for which asynchronous I/O failed.
If it doesn't reference the page but does another `read' instead, it can
receive the error code then.

The point is that these page sharing optimisations might well provide a
worthwhile performance improvement to existing I/O intensive software,
without requiring any changes to the software. They'd also leave more
CPU left for everything else.

>>>>> "Ingo" == Ingo Molnar <mingo@pc5829.hil.siemens.co.at> writes:

Ingo> havent checked out your source yet, but under AIX similar
Ingo> approach is used for async IO. Under Linux, if processes that
Ingo> get clone()-d with VM_SHARE, and run on another CPU (or even
Ingo> the same if the socket operation sleeps) and use the same
Ingo> page(s) which are locked, then we might get problems. If there
Ingo> is a local variable right in the same page as your IO buffer,
Ingo> doesnt this cause problems or unnecessary/unexpected sleeps?

Sharing semantics
=================

This is what I meant when I said that the sharing semantics get a bit
more complicated. In this case, if either cloned task writes to the
page, and they are not supposed to be able to modify the I/O buffer, a
copy is made (copy-on-write), but the copy is still shared between the
two tasks (though not the I/O buffer). The original page may still be
shared between I/O buffers, other tasks, and even other addresses within
the same cloned tasks. Sometimes the tasks are allowed to modify the
I/O buffer, even during I/O (see "Asynchronicity").

By the way, with this model a single page can be shared between
different places within a single task, even within the same VM area,
without being shared in the traditional sense. That is, a write to one
instance of the page should not be reflected to other instances. This
is easy to produce: the task writes some pages to a file, and then reads
them to another part of its data area. The data then occurs in two
places in a single VM area in a single task, but modifications to each
page must be made independent by copying the page on demand.

Of course, a single page can also be properly shared in the traditional
sense, in all sorts of ways using the traditional mechanisms for this
(`mmap' of /proc/../mem, `mmap' of files, shared-memory calls, VM_SHARE,
etc.). Both sorts of sharing occur at the same time, so it's all quite
complicated.

Asynchronicity
==============

It is wise to treat all I/O as asynchronous, with synchronous I/O being
a special case where the task sleeps until notified later. Whatever
optimisations apply to the synchronous case (with no other "traditional"
sharing), try to apply them in general.

Being asynchronous, it doesn't really matter if other tasks see the data
being written directly into their VM space, does it? Similarly, other
tasks are welcome to modify the data during an asynchronous write
operation. It is only when some task (or group of them sharing in the
traditional sense) or I/O subsystem considers a page to be all its own
that you have to mark them copy-on-write.

Writing
=======

For example, just before the net subsystem calculates the checksum for a
region (during a `write'), it will have to claim a "private" copy of the
pages in question. This simply involves marking the pages in all VM
areas in all tasks as requiring copy-on-write. Other information that
applies to the tasks can be calculated lazily later. While the data is
being moved though (IP routing, no room in the ethernet card's TX
buffer, etc.), there is no need to mark the pages in this way. Disk
drivers, on the other hand, don't care if a task modifies the data as it
is written to disk, as this is presumably allowed by asynchronous write
semantics. No sleeping on behalf of the I/O systems is involved -- if a
page is copied, it is the responsibility of the task that wrote to it to
do the copy.

You get the maximum benefits from page sharing by delaying copying, and
copy-on-write marking (which isn't free either) as much as possible.

Reading
=======

When data is read, I/O appears to be modifying a page. If all of the
page-mappings in all tasks involved in the read expect to see the new
data, I/O can go ahead and simply modify the page. If not, the I/O can
write to a new page, and when it has finished the page is mapped into
all the places where the data is supposed to be seen. If the I/O
doesn't fill the page with data, then after the I/O some data is copied
from the old page to the new one (which is then remapped). If you know
how much data you're expecting, then in principle this latter copying
can happen in parallel with the I/O. Even if not, the initial part of
the page can always be copied in parallel. Actually, after the I/O has
completed you can check to see if the page still needs to be copied, and
if not it may or may not be faster to copy the newly read data into the
original page.

If you're reading from the network, involving headers and checksums and
things, then the I/O won't be happening directly into running tasks'
pages. In this case, everything happens as in the second case above:
I/O reads the data, headers are processed, and eventually the data is
copied into the desired page, or the rest of the original page copied
into the newly read data page.

In the special case of I/O reading a whole number of pages, both of the
above scenarios avoid copying data. In general, they minimise amount of
data copied.

As for partial page copies, as ever it is worth treating the copying of
a zero-mapped page as a special case.

Asynchronicity again
====================

In all situations, I think the whatever copying needs doing can be done
without locking the rest of the kernel. That is, the copying can be
done on behalf of some task or other, and thus pre-empted safely if
there are more pressing things to be doing. If you want to get silly,
the post-read copies could be delayed until a task actually
references the copied data, though in general tasks tend to reference
the data they just requested very soon afterwards, so there's no point.

I see what you mean about unexpected sleeps -- in a device driver, when
data would like to get written to a page, for example. Then there is
sometimes a need to allocate a new page for the incoming data. This is
a consequence of delaying (and sometimes avoiding) page allocations. It
shouldn't ever lead to more sleeping, allocation or copying than the
current system, but it can occur at less predictable times. Perhaps the
best solution is to allocate the page as late as possible outside
bottom-halfs, etc., shortly before commencing an I/O operation that
needs a new page. At that point, check if the page needs to be
duplicated, if a new one needs to be allocated, or if the existing one
can be used. That is also a good time to check if the existing page is
swapped out :-).

Page allocation and the zero-page pool
======================================

Ingo> currently, apart from a (usually very small) initial time, all
Ingo> buffers are in use. There is a small amount of free pages
Ingo> reserved for interrupt handlers and alike, but "pages that
Ingo> could be zeroed out" are not available.

As ever, it is worth maintaining a small pool of unallocated pages, so
that last-minute page allocations tend not to have to sleep. This is
done already of course. These pages can be shared with the zero-page
pool. Uninitialised pages turn into zeroed ones when there's nothing
better to do. If you need a new page and there's none in the
uninitialised pool, just take one of the zeroed ones. If there aren't
any of those, _then_ you have to steal another page, hopefully one
mirrored in swap, but potentially requiring swapping something out and
sleeping.

If you're doing a partial fill of a zero-mapped page, then you have the
option of fetching a pre-zeroed page and partially filling it, or doing
the partial fill into an uninitialised page and zeroing out the
remainder, either in parallel or afterwards. The former avoids any
zero-fill right now, but negates the benefits of having pre-zeroed the
part that will be written over. That is bad if the page could have been
used later, by something that involves more zero-filling.

Swapping benefits
=================

If I/O (or other data transfer) will update pages that are swapped out,
all this memory-copying avoidance also has the advantage that the pages
don't need to be read from disk before they're updated. The disk copies
can simply be discarded.

Another benefit is that by sharing more data, everything requires less
RAM and swap space. So as well as cutting down on memory copy times,
you also reduce swap times.

The cost vs. benefit of maintaining pre-zeroed and free pages is a more
difficult thing to quantify: it seems to be a good idea in some
circumstances, and a bad one in others. Then there are decisions
relating to which way to copy data, and when to use pre-zeroed pages for
partial page fills. Someone will be able to find some reasonable
heuristics to control all of this, I'm sure. One thing is clear: apart
from the complexity of the implementation, this is unlikely to lead to
worse performance than the current system under any reasonable
circumstances.

Optimising competing strategies
===============================

There are costs and benefits to all of the various competing strategies.
Delaying page copying, and thus avoiding much of it, seems to always be
a good idea. Apart from the complexity that is, but there is probably
some clean way to model it. There is also some overhead in maintaining
the more complex page data structures, and updating the mappings in all
tasks sharing the pages. That also applies now when a shared page is
swapped out, but will be worse with more sharing. Or course, with more
sharing there is less swapping anyway.

You can delay adding page mappings to a task (though not unmapping or
copy-on-write marking), which has its own benefits in that your
statistics regarding a process' working set are more accurate (so you
swap out the wrong pages less often), at the cost of more page fault
processing. (There's something to be said for deliberately unmapping
other, random pages from a task for this reason). Again, costs
vs. benefits. There is the decision as to whether to use a pre-zeroed
page now, or to zero-fill part of a page either now or later. Not to
mention the problem of how many free and pre-zeroed pages to aim for.
Probably the best strategy is to maintain statistics estimating how
effective each strategy would have been in the recent past, and use
those to guide decisions when decisions need to be made. This is really
a different problem though, and is tied in with optimising swapping
strategies.

Finally
=======

The following silly shell command will go much, much faster:

dd if=/dev/zero bs=1024k count=200 | cat | cat | cat | cat > /dev/null

Thanks for taking an interest, BTW.

Enjoy,
-- Jamie Lokier