Re: 2.6.19 file content corruption on ext3

From: Linus Torvalds
Date: Tue Dec 19 2006 - 12:44:03 EST




Btw,
here's a totally new tangent on this: it's possible that user code is
simply BUGGY.

There is one case where the kernel actually forcibly writes zeroes into a
file: when we're writing a page that straddles the "inode->i_size"
boundary. See the various writepages in fs/buffer.c, they all contain
variations on that theme (although most of them aren't as well commented
as this snippet):

/*
* The page straddles i_size. It must be zeroed out on each and every
* writepage invocation because it may be mmapped. "A file is mapped
* in multiples of the page size. For a file that is not a multiple of
* the page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file."
*/
kaddr = kmap_atomic(page, KM_USER0);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);

Now, this should _matter_ only for user processes that are buggy, and that
have written to the page _before_ extending it with ftruncate(). That's
definitely a serious bug, but it's one that can do totally undetected
depending on when the actual write-out happens.

So what I'm saying is that if we end up writing things earlier thanks to
the more aggressive dirty-page-management thing in 2.6.19, we might
actually just expose a long-time userspace bug that was just a LOT harder
to trigger before..

I'm not saying this is the cause of all this, but we've been tearing our
hair out, and it migth be worthwhile trying this really really really
stupid patch that will notice when that happens at truncate() time, and
tell the user that he's a total idiot. Or something to that effect.

Maybe the reason this is so easy to trigger with rtorrent is not because
rtorrent does some magic pattern that triggers a kernel bug, but simply
because rtorrent itself might have a bug.

Ok, so it's a long shot, but it's still worth testing, I suspect. The
patch is very simple: whenever we do an _expanding_ truncate, we check the
last page of the _old_ size, and if there were non-zero contents past the
old size, we complain.

As an attachement is a test-program that _should_ trigger a
kernel message like

a.out: BADNESS: truncate check 17000

for good measure, just so that you can verify that the patch works and
actually catches this case.

(The 17000 number is just the one-hundred _invalid_ 0xaa bytes - out of
the 200 we wrote - that were summed up: 100*0xaa == 17000. Anything
non-zero is always a bug).

I doubt this is really it, but it's worth trying. If you fill out a page,
and only do "ftruncate()" in response to SIGBUS messages (and don't
truncate to whole pages), you could potentially see zeroes at the end of
the page exactly because _writeout_ cleared the page for you! So it
_could_ explain the symptoms, but only if user-space was horribly horribly
broken.

Linus

----
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);

+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr,KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);

out_truncate:#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <string.h>

int main(int argc, char **argv)
{
char *mapping;
int fd;

fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, 10) < 0)
return -1;
mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (-1 == (int)(long)mapping)
return -1;
memset(mapping, 0x55, 10);
if (ftruncate(fd, 100) < 0)
return -1;
memset(mapping, 0xaa, 200);
if (ftruncate(fd, 200) < 0)
return -1;
return 0;
}