faster fsync and real fdatasync [patch against 2.3.44]

From: Martin Schenk (schenkm@ping.at)
Date: Sun Feb 13 2000 - 12:11:24 EST


I have ported my patches from last October to the current kernel
version.

Here is a short description of what problem they address, how they do it
and why I think it is important that this problem finally gets solved:

What are the problems ?

- slow fsync on large files

  If you call fsync(fd) on a file descriptor the kernel scans a list of
  all cached pages for this file and writes those that are changed to
disk.

  If a lot of pages are cached, the scan through this list can take a
very
  long time (about 1ms/Mbyte on my dual PIII-450).
  This leads to bizarre situations where fsync is not bound by hard disk
  seek time, but main memory access speed ! [compare benchmarks at the
end]

- no real fdatasync

  fdatasync is a faster version of fsync that does not write the inode
  every time, if only unimportant information (file access/modification
  times) is changed. This eliminates the need for a lot of seeks from
the
  data to the inode (and back). [compare benchmarks at the end]

How did I implement this ?

- faster fsync

  The patch avoids going through the whole list of cached pages
  "inode->mapping->pages" by moving all "dirty" pages to the end of this
  list and maintaining a pointer to the first "dirty" page:
  "inode->mapping->i_drtypages".
  All these "dirty" pages have a status bit (pdirty) set.

  Maintaining this list does take almost no overhead:
  - when a buffer head goes into the dirty list [in refile_buffer],
    the pdirty bit on the page associated with this buffer is checked.
    If it is not yet set, this page is moved to the end of
    "inode->mapping->pages" and the pdirty bit is set.
  - at fsync time, only the pages after i_drtypages have to be checked,
    the pdirty bit of these pages is cleared and i_drtypages is moved
    to point at the end of "inode->mapping->pages".

- fdatasync

  the patch adds a flag I_RDIRTY that indicates that "important" data in
  the inode has been changed.
  make_inode_dirty is changed so that it sets both I_DIRTY and I_RDIRTY
  an additional function: make_inode_time_dirty only sets I_DIRTY
  ext2_sync_file checks for I_RDIRTY before syncing the inode
  (fsync explicitly sets I_RDIRTY to force writing the inode)

Why is this important ?

Some applications suffer a lot from slow fsyncs, especially if they
are transaction based and have to sync a lot.
There is a database vendor (Solid), who suggests to customers
complaining about slow checkpoints that they should change their
kernels to use file_fsync (which syncs a whole device) instead of
the normal fsync !

How to use this patch ?

Apply it and define "CONFIG_FASTSYNC" in order to enable the patch.
If you want to be sure that no dirty pages are missed, you can test
the patch by defining both "CONFIG_FASTSYNC" and "TEST_FASTSYNC".
This scans through all of [inode->mapping->pages] like the "normal"
fsync and detects whether the "new fsync" would have missed a page
(of course it is as slow as the old version).

Hope some people test this patch,

Martin

Please give me feedback, and tell me what I should do to have this
patch(or something better) in 2.4.

[ Or should I just publish my benchmark comparing the "old fsync"
with NT in a suitable newsgroup ;-) ]

Here the promised benchmarks [these are from last October]:

The following data has been generated on a dual PIII-450 with
128Mbyte memory
(IBM-DJNA-351520 ide hard disk ~9msec, write cache on disk
***disabled***)

The test creates a file and then does 1000 individual writes in
the file, syncing after every write.

blocks (2Kb each) 2.3.19 2.3.19+fastsync NT
--------------------------------------------------------
5000 28 17 37
50000 137 27 42
5000 (fdatasync) 0.2-9
50000(fdatasync) 3-9

2.3.19 takes about 30% CPU at 5000, >80% at 50000 blocks, the
others don't take significant CPU time.
With 2.3.19+fastsync, the time corresponds to 2 seeks (one to
the inode, one to the data) in the fsync case (the distance is
bigger in the bigger file).
The time in the fdatasync case depends a lot on the actual place
of the file on disk (as there is only a seek from data to data).
With the smaller file, you need almost no arm movements if you
are lucky :-)

[ if your times are a lot lower and fdatasync makes no difference,
try to disable hard disk write caching (on a production system you
should do it anyway, unless you have same special battery cached RAID)]

btw: I assume NTFS is slower because it needs to commit its metadata -
does anybody know whether there is a fdatasync equivalent on NT ?

If somebody is interested in my benchmark, just mail me or search the
archives of October for it.

--- ./fs/buffer.c Fri Feb 11 20:02:03 2000
+++ ../linuxelf-2.3/./fs/buffer.c Sun Feb 13 15:41:31 2000
@@ -372,6 +372,18 @@
 
         /* We need to protect against concurrent writers.. */
         down(&inode->i_sem);
+
+#ifdef CONFIG_FASTSYNC
+ { /* mark inode as dirty, so it always gets synced
+ this is the only difference to fdatasync !
+ in the moment this is used only by ext2
+ */
+ extern spinlock_t inode_lock;
+ spin_lock(&inode_lock);
+ inode->i_state |= I_RDIRTY;
+ spin_unlock(&inode_lock);
+ }
+#endif
         err = file->f_op->fsync(file, dentry);
         up(&inode->i_sem);
 
@@ -404,8 +416,9 @@
         err = -EINVAL;
         if (!file->f_op || !file->f_op->fsync)
                 goto out_putf;
-
+#ifndef CONFIG_FASTSYNC
         /* this needs further work, at the moment it is identical to fsync() */
+#endif
         down(&inode->i_sem);
         err = file->f_op->fsync(file, dentry);
         up(&inode->i_sem);
@@ -924,7 +937,19 @@
         if (buffer_locked(bh))
                 dispose = BUF_LOCKED;
         if (buffer_dirty(bh))
+#ifdef CONFIG_FASTSYNC
+ {
+ void move_page_to_inode_queue_end(struct page *);
+ dispose = BUF_DIRTY;
+ /* make sure the page is in the "dirty" section of the page list */
+ if (!PagePDirty(bh->b_page))
+ { /* mark Page as dirty, move to inode queue end */
+ move_page_to_inode_queue_end(bh->b_page);
+ }
+ }
+#else
                 dispose = BUF_DIRTY;
+#endif
         if (buffer_protected(bh))
                 dispose = BUF_PROTECTED;
         if (dispose != bh->b_list) {
--- ./fs/inode.c Sun Feb 13 16:46:21 2000
+++ ../linuxelf-2.3/./fs/inode.c Sun Feb 13 16:54:11 2000
@@ -116,6 +116,19 @@
 
         if (sb) {
                 spin_lock(&inode_lock);
+#ifdef CONFIG_FASTSYNC
+ /* we set I_DIRTY and I_RDIRTY so that fdatasync syncs the inode.
+ we test against I_RDIRTY, which is stricter */
+ if (!(inode->i_state & I_RDIRTY)) {
+ /* only add valid nodes that were not already added by
+ mark_inode_time_dirty */
+ if (!(inode->i_state & I_DIRTY) && !list_empty(&inode->i_hash)) {
+ list_del(&inode->i_list);
+ list_add(&inode->i_list, &sb->s_dirty);
+ }
+ inode->i_state |= I_DIRTY | I_RDIRTY;
+ }
+#else
                 if (!(inode->i_state & I_DIRTY)) {
                         inode->i_state |= I_DIRTY;
                         /* Only add valid (ie hashed) inodes to the dirty list */
@@ -124,10 +137,31 @@
                                 list_add(&inode->i_list, &sb->s_dirty);
                         }
                 }
+#endif
                 spin_unlock(&inode_lock);
         }
 }
 
+#ifdef CONFIG_FASTSYNC
+void __mark_inode_time_dirty(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (sb) {
+ spin_lock(&inode_lock);
+ if (!(inode->i_state & I_DIRTY)) {
+ inode->i_state |= I_DIRTY;
+ /* Only add valid (ie hashed) inodes to the dirty list */
+ if (!list_empty(&inode->i_hash)) {
+ list_del(&inode->i_list);
+ list_add(&inode->i_list, &sb->s_dirty);
+ }
+ }
+ spin_unlock(&inode_lock);
+ }
+}
+#endif
+
 static void __wait_on_inode(struct inode * inode)
 {
         DECLARE_WAITQUEUE(wait, current);
@@ -168,6 +202,9 @@
                          inode->i_count ? &inode_in_use : &inode_unused);
                 /* Set I_LOCK, reset I_DIRTY */
                 inode->i_state ^= I_DIRTY | I_LOCK;
+#ifdef CONFIG_FASTSYNC /* clear I_RDIRTY */
+ inode->i_state &= ~I_RDIRTY;
+#endif
                 spin_unlock(&inode_lock);
 
                 write_inode(inode);
--- ./fs/ext2/fsync.c Fri Feb 11 20:02:04 2000
+++ ../linuxelf-2.3/./fs/ext2/fsync.c Sun Feb 13 16:07:32 2000
@@ -136,7 +136,11 @@
                  */
                 goto skip;
 
+#ifdef CONFIG_FASTSYNC
+ err = generic_buffer_fast_fdatasync(inode, 0, ~0UL);
+#else
         err = generic_buffer_fdatasync(inode, 0, ~0UL);
+#endif
 
         for (wait=0; wait<=1; wait++)
         {
@@ -151,7 +155,13 @@
                                       wait);
         }
 skip:
+#ifdef CONFIG_FASTSYNC
+ /* only sync if marked really dirty (more than ctime/mtime changed) */
+ if (inode->i_state & I_RDIRTY)
+ err |= ext2_sync_inode(inode);
+#else
         err |= ext2_sync_inode (inode);
+#endif
         unlock_kernel();
         return err ? -EIO : 0;
 }
--- ./mm/filemap.c Fri Feb 11 20:02:12 2000
+++ ../linuxelf-2.3/./mm/filemap.c Sun Feb 13 15:34:09 2000
@@ -76,6 +76,59 @@
         atomic_dec(&page_cache_size);
 }
 
+#ifdef CONFIG_FASTSYNC
+/* do the work for remove_page_from_inode_queue,
+ when i_drtypages has to be changed
+*/
+void change_dirty_pointer(struct address_space *mapping, struct page *page)
+{
+ if (!mapping->nrpages) mapping->i_drtypages=NULL;
+ else
+ {
+ struct page *first=list_entry(&mapping->pages,struct page,list);
+ struct page *next=list_entry(page->list.next,struct page,list);
+ if (next==first) { /* dirtypages was last entry */
+ struct page *prev=list_entry(page->list.prev,struct page,list);
+ mapping->i_drtypages=prev;
+ }
+ else {
+ mapping->i_drtypages=next;
+ }
+ }
+}
+#endif
+
+#ifdef CONFIG_FASTSYNC
+/* this has to be called whenever a page might get dirty, or fsync
+ wont sync everything (define TEST_FASTSYNC to check whether you
+ got all the places)
+ it moves the page to the end of the page list (ie after i_drtypages)
+ this gets called in __refile_buffer, when the bh gets marked dirty
+ and the page is not yet "pdirty"
+ I hope it's ok to take the pagecache_lock there
+*/
+
+void move_page_to_inode_queue_end(struct page * page)
+{
+ spin_lock(&pagecache_lock);
+
+ SetPagePDirty(page); /* mark page as "possible dirty" */
+
+ /* check whether already at the right position */
+ if (!page->mapping) goto end;
+ if (page==page->mapping->i_drtypages) goto end;
+
+ /* remove from the current position */
+ list_del(&page->list);
+ /* insert at the end */
+ list_add_tail(&page->list, &page->mapping->pages);
+
+end:
+ spin_unlock(&pagecache_lock);
+}
+
+#endif
+
 /*
  * Remove a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
@@ -388,6 +441,11 @@
  *
  * Start the IO..
  */
+
+#ifdef TEST_FASTSYNC
+static int drty_ok;
+#endif
+
 static int writeout_one_page(struct page *page)
 {
         struct buffer_head *bh, *head = page->buffers;
@@ -397,6 +455,11 @@
                 if (buffer_locked(bh) || !buffer_dirty(bh) || !buffer_uptodate(bh))
                         continue;
 
+#ifdef TEST_FASTSYNC
+ if (!drty_ok) { /* uh,uh this should not happen :-( */
+ printk("FASTSYNC does not work :-(\n");
+ }
+#endif
                 bh->b_flushtime = 0;
                 ll_rw_block(WRITE, 1, &bh);
         } while ((bh = bh->b_this_page) != head);
@@ -417,6 +480,80 @@
         return error;
 }
 
+#ifdef CONFIG_FASTSYNC
+static int do_buffer_fast_fdatasync(struct inode *inode, unsigned long start, unsigned long end, int (*fn)(struct page *))
+{
+ struct list_head *head, *curr;
+ struct page *page=NULL;
+ int retval = 0;
+ int clear_drty = 0;
+
+ head = &inode->i_mapping->pages;
+
+ if (fn==waitfor_one_page && !start && (end==~0UL)) {
+ /* we can now clear the pdirty bits and move i_drtypages to the end */
+ clear_drty = 1;
+ }
+ spin_lock(&pagecache_lock);
+#ifdef TEST_FASTSYNC /* test: scan everything for missed pages */
+ drty_ok=0;
+ curr = head->next;
+#else
+ /* start only at the first dirty pages */
+ if (!&inode->i_mapping->i_drtypages) {
+ /* no pages ! */
+ curr = head->next;
+ }
+ else curr = &inode->i_mapping->i_drtypages->list;
+#endif
+ while (curr != head) {
+ page = list_entry(curr, struct page, list);
+ curr = curr->next;
+#ifdef TEST_FASTSYNC /* drty pages are ok after i_drtypages */
+ if (page == inode->i_mapping->i_drtypages) drty_ok=1;
+#endif
+ if (clear_drty) ClearPagePDirty(page); /* clear the dirty bit ! */
+ if (!page->buffers)
+ continue;
+ if (page->index >= end)
+ continue;
+ if (page->index < start)
+ continue;
+
+ get_page(page);
+ spin_unlock(&pagecache_lock);
+ lock_page(page);
+
+ /* The buffers could have been free'd while we waited for the page lock */
+ if (page->buffers)
+ retval |= fn(page);
+
+ UnlockPage(page);
+ spin_lock(&pagecache_lock);
+ curr = page->list.next;
+ page_cache_release(page);
+ }
+ /* set pointer to end of list */
+ if (clear_drty) inode->i_mapping->i_drtypages=page;
+ spin_unlock(&pagecache_lock);
+
+ return retval;
+}
+
+/*
+ * Two-stage data sync: first start the IO, then go back and
+ * collect the information..
+ */
+int generic_buffer_fast_fdatasync(struct inode *inode, unsigned long start_idx, unsigned long end_idx)
+{
+ int retval;
+
+ retval = do_buffer_fast_fdatasync(inode, start_idx, end_idx, writeout_one_page);
+ retval |= do_buffer_fast_fdatasync(inode, start_idx, end_idx, waitfor_one_page);
+ return retval;
+}
+#endif
+
 static int do_buffer_fdatasync(struct inode *inode, unsigned long start, unsigned long end, int (*fn)(struct page *))
 {
         struct list_head *head, *curr;
@@ -1902,7 +2039,12 @@
         if (count) {
                 remove_suid(inode);
                 inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+#ifdef CONFIG_FASTSYNC
+ mark_inode_time_dirty(inode); /* only ctime/mtime changed ->
+ fdatasync does not need to sync inode */
+#else
                 mark_inode_dirty(inode);
+#endif
         }
 
         while (count) {
@@ -1946,7 +2088,14 @@
                         pos += status;
                         buf += status;
                         if (pos > inode->i_size)
+#ifdef CONFIG_FASTSYNC /* file sizes changes -> fdatasync has to sync inode */
+ {
+ inode->i_size = pos;
+ mark_inode_dirty(inode);
+ }
+#else
                                 inode->i_size = pos;
+#endif
                 }
 unlock:
                 /* Mark it unlocked again and drop the page.. */
--- ./include/linux/fs.h Sun Feb 13 16:46:22 2000
+++ ../linuxelf-2.3/./include/linux/fs.h Sun Feb 13 16:54:12 2000
@@ -350,6 +350,14 @@
         struct address_space_operations *a_ops; /* methods */
         void *host; /* owner: inode, block_device */
         void *private; /* private data */
+#ifdef CONFIG_FASTSYNC
+ struct page *i_drtypages; /* pointer to first "dirty" page in
+ the list of pages
+ the FASTSYNC patch keeps all "dirty"
+ pages at the end of this list so that
+ fsync can skip all "non-dirty" pages
+ */
+#endif
 };
 
 struct block_device {
@@ -436,13 +444,31 @@
 #define I_DIRTY 1
 #define I_LOCK 2
 #define I_FREEING 4
+#ifdef CONFIG_FASTSYNC
+#define I_RDIRTY 8 /* I_RDIRTY: more than just ctime/mtime is changed ->
+ fdatasync has to sync the inode
+ */
+#endif
 
 extern void __mark_inode_dirty(struct inode *);
 static inline void mark_inode_dirty(struct inode *inode)
 {
+#ifdef CONFIG_FASTSYNC /* make sure I_RDIRTY is set */
+ if (!(inode->i_state & I_RDIRTY))
+#else
         if (!(inode->i_state & I_DIRTY))
+#endif
                 __mark_inode_dirty(inode);
 }
+
+#ifdef CONFIG_FASTSYNC /* only mtime/ctime changed */
+extern void __mark_inode_time_dirty(struct inode *);
+static inline void mark_inode_time_dirty(struct inode *inode)
+{
+ if (!(inode->i_state & I_DIRTY))
+ __mark_inode_time_dirty(inode);
+}
+#endif
 
 struct fown_struct {
         int pid; /* pid or -pgrp where SIGIO should be sent */
--- ./include/linux/mm.h Sun Feb 13 16:48:24 2000
+++ ../linuxelf-2.3/./include/linux/mm.h Sun Feb 13 16:56:04 2000
@@ -163,6 +163,9 @@
 #define PG_skip 10
 #define PG_swap_entry 11
 #define PG_highmem 12
+#ifdef CONFIG_FASTSYNC
+#define PG_pdirty 13 /* page possibly dirty (after i_drtypages) */
+#endif
                                 /* bits 21-30 unused */
 #define PG_reserved 31
 
@@ -201,6 +204,12 @@
 #define PageHighMem(page) test_bit(PG_highmem, &(page)->flags)
 #else
 #define PageHighMem(page) 0 /* needed to optimize away at compile time */
+#endif
+
+#ifdef CONFIG_FASTSYNC
+#define PagePDirty(page) test_bit(PG_pdirty, &(page)->flags)
+#define SetPagePDirty(page) set_bit(PG_pdirty, &(page)->flags)
+#define ClearPagePDirty(page) clear_bit(PG_pdirty, &(page)->flags)
 #endif
 
 #define SetPageReserved(page) set_bit(PG_reserved, &(page)->flags)
--- ./include/linux/pagemap.h Sun Feb 13 16:48:24 2000
+++ ../linuxelf-2.3/./include/linux/pagemap.h Sun Feb 13 16:56:04 2000
@@ -93,6 +93,10 @@
         if (!mapping->nrpages++) {
                 if (!list_empty(head))
                         BUG();
+#ifdef CONFIG_FASTSYNC
+ // empty pagelist -> i_drtypages has to point to the first page
+ mapping->i_drtypages=page;
+#endif
         } else {
                 if (list_empty(head))
                         BUG();
@@ -101,11 +105,22 @@
         page->mapping = mapping;
 }
 
+#ifdef CONFIG_FASTSYNC
+void change_dirty_pointer(struct address_space *mapping, struct page *page);
+#endif
+
 extern inline void remove_page_from_inode_queue(struct page * page)
 {
         struct address_space * mapping = page->mapping;
 
         mapping->nrpages--;
+
+#ifdef CONFIG_FASTSYNC
+ if (mapping->i_drtypages == page) { /* have to change i_drtypages */
+ change_dirty_pointer(mapping, page);
+ }
+#endif
+
         list_del(&page->list);
 }
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Feb 15 2000 - 21:00:24 EST