Re: Possible kernel bug

Bernd Schmidt (crux@Pool.Informatik.RWTH-Aachen.DE)
Thu, 26 Jun 1997 10:50:50 +0200 (MET DST)


Hello,

> I have discovered that working in memory mmap()ed from files can lead
> to really poor performance. I have put more details and a test program
> at:
>
> http://www.acs.uncwil.edu/~jlnance/mapsort/index.html

I can't currently download that (DNS name lookup failure). I've been hacking
a bit in this area, and I've come to the following conclusion:

Shared file mappings are broken in current kernels.

There are several reasons. Most of them are related to the fact that the
page cache does not know about dirty pages. A page is marked dirty in the
page table by the CPU whenever someone writes to the page. The swap-out code
(try_to_swap_out) looks for pages with the dirty bit set, and calls the
vma->vm_ops->swapout function for these pages. This is where the problems
start.

1. If you have two processes, and the page is written to by both, you have
two pte's in which the page is marked dirty. It will be written to disk
twice (and it scales nicely - if you have n dirty pte's, the kernel will
write it n times).
2. You can't get rid of a page unless it has been removed from _all_ process
page tables. There was a (broken) patch to address this on the list today.

Swapping out a page is implemented by making up a dummy struct file and
calling the file write function. There are _many_ problems with this.

3. First, filemap_write_page gets the inode semaphore. But it could already be
locked by the same process (if it entered sys_write, then needed some
memory and thus caused the swapout) ==> deadlock.
4. The filesystem specific file write function allocates buffers for the data
and copies it into the buffers from the page cache (overhead).
5. The filesystem specific file write function calls update_vm_cache to copy
the data from the page cache page to the same page cache page (overhead).
6. The filesystem specific file write function doesn't even call ll_rw_blk
to write out the buffers directly - it just marks them dirty.
(You could work around this by setting O_SYNC in the file pointer made up
in filemap_write_page but this can kill performance since you want to
cache the blocks containing the block bitmaps).

Finally, the whole process needs plenty of buffers. This may lead to
additional memory allocations which are not guaranteed to succeed (after all,
swapping only happens in low-memory situations). Swapping out a page can
fail for this reason, and you end up getting SIGBUS errors.

bernd