Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writevspurious zero bytes (x86_64/not i386, bisected, reproducable))

From: Bron Gondwana
Date: Wed Jun 18 2008 - 01:12:21 EST


On Tue, Jun 17, 2008 at 09:03:17PM -0700, Linus Torvalds wrote:
> On Wed, 18 Jun 2008, Bron Gondwana wrote:
> >
> > For my sins, I appear to be becoming the world expert on
> > that particular file.
>
> Heh. Congrats ;)
>
> > I've debugged skiplist bugs many times over, and completely rewritten
> > the locking code. It really does some pretty evil things - the memory
> > accesses look something like this:
> >
> > [file...................]
> > [mmap^....^.^........^^..................................]
> > [file...................++++++++++++]
> > [mmap^....^.^........^^.^^ ^ ^^.....................]
> >
> > Where (^) is the bits that get accessed. All reads are via
> > the mmap, all writes are done with retry_write or
> > retry_writev (Cyrus library functions that keep hammering
> > until all the bytes are written)
>
> Is there any reason it doesn't use mmap(MAP_SHARED) and make the
> modifications that way too?

Portability[tm].

It actually does use MAP_SHARED already, but only for reading.
Writing is all done with seeks and writes, followed by a map
"refresh", which is really just an unmmap/mmap if the file has
extended past the "SLOP" (between 8 and 16 k after the end of
the file length at last mapping).

I can't just right now find the place where the reasoning behind
this was explained to me. The theory I think was that mmap write
support is unreliable across systems, but read support is pretty
good (except HPUX for which there is map_stupidshared.c)

> Because quite frankly, the mixture of doing mmap() and write() system
> calls is quite fragile - and I'm not saying that just because of this
> particular bug, but because there are all kinds of nasty cache aliasing
> issues with virtually indexed caches etc that just fundamentally mean that
> it's often a mistake to mix mmap with read/write at the same time.

I'm not actually a maintainer for Cyrus, I just write patches to
keep our mail servers working. I'll pass this on.

> (For the same reason it's not a good idea to mix writing through an mmap()
> and then using read() to read it - again, you can have some nasty aliasing
> going on there).
>
> So this particular issue was definitely a kernel bug (and big thanks for
> making such a good test-case), but in general, it does sound like Cyrus is
> actively trying to dig itself into a nasty hole there.

Yeah, indeed.

I suspect the response from the Cyrus side might include a
small dose of "POSIX says it's valid to do this, and making
it work is the kernel programmers' lookout".

Ahh - I found the explaination in doc/internal/hacking in
the Cyrus source tree. While 'ack' is a nice tool, it
doesn't check files with no extention by default. Ho hum:

- map_refresh and map_free

- In many cases, it is far more effective to read a file via the operating
system's mmap facility than it is to via the traditional read() and
lseek system calls. To this end, Cyrus provides an operating system
independent wrapper around the mmap() services (or lack thereof) of the
operating system.

- Cyrus currently only supports read-only memory maps, all writes back
to a file need to be done via the more traditional facilities. This is
to enable very low-performance support for operating systems which do not
provide an mmap() facility via a fake userspace mmap.

- To create a map, simply call map_refresh on the map (details are in
lib/map.h). To free it, call map_free on the same map.

- Despite the fact that the maps are read-only, it is often useful to open
the file descriptors O_RDWR, especially if the file decriptors could
possibly be used for writing elsewhere in the code. Some operating
systems REQUIRE file descriptors that are mmap()ed to be opened
O_RDWR, so just do it.

If I was God in the Cyrus world (woot) I suspect I'd
provide some sort of OS independent wrapper around the
various write functions, using mmap where appropriate,
while still working via lseek/write for those systems
without mmap support.

(Added CC: Ken Murchison, on the grounds that he actually
is God in the Cyrus world)

Thanks for the good explaination. I'll have a look at the
Cyrus code and see just how tricky that would actually be
(even just doing the skiplist, index and cache code would
hit 99% of the cases where files are both mmaped and
written concurrently)

Bron.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/