[PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()

From: Fengguang Wu
Date: Sun Aug 12 2007 - 05:23:03 EST


Miklos Szeredi <miklos@xxxxxxxxxx> and me identified a writeback bug:

> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.

It can be produced by the following test scheme:

# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done

# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023


% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x

% dmesg|grep 847824 # generated by a debug printk
[ 529.263184] redirtied inode 847824 line 548
[ 564.250872] redirtied inode 847824 line 548
[ 594.272797] redirtied inode 847824 line 548
[ 629.231330] redirtied inode 847824 line 548
[ 659.224674] redirtied inode 847824 line 548
[ 689.219890] redirtied inode 847824 line 548
[ 724.226655] redirtied inode 847824 line 548
[ 759.198568] redirtied inode 847824 line 548

# line 548 in fs/fs-writeback.c:
543 if (wbc->pages_skipped != pages_skipped) {
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
548 redirty_tail(inode);
549 }

More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.

This patch fixes this bug. I'm not quite sure about it.
But at least the comment in generic_sync_sb_inodes():

544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */

and the comment in __block_write_full_page():

1713 /*
1714 * The page was marked dirty, but the buffers were
1715 * clean. Someone wrote them back by hand with
1716 * ll_rw_block/submit_bh. A rare case.
1717 */

do not quite agree with each other. The page writeback is skipped not because
of 'locked buffer', but 'clean buffer'.

This is the new behavior after the patch:

# bin/test-writeback.sh
nr_dirty 60
847824 /var/x
nr_dirty 60
nr_dirty 31139
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.55338 seconds, 135 MB/s
nr_dirty 47137
nr_dirty 46147
nr_dirty 46147
nr_dirty 46147
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46109
nr_dirty 46109
nr_dirty 46109
nr_dirty 46113
nr_dirty 46113
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46089
nr_dirty 46089
nr_dirty 46090
nr_dirty 46093
nr_dirty 46093
nr_dirty 15
nr_dirty 15
nr_dirty 15
nr_dirty 15

It is pretty numbers: wait 30s and write ALL:)

But another run is not so good:

# sh bin/test-writeback.sh
mount: proc already mounted
nr_dirty 223
nr_dirty 223
nr_dirty 23664
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.51092 seconds, 139 MB/s
nr_dirty 47299
nr_dirty 47271
nr_dirty 47260
nr_dirty 47260
nr_dirty 47267
nr_dirty 47267
nr_dirty 47329
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47606
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47480
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 46470
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 45428
nr_dirty 45435
nr_dirty 45436
nr_dirty 45436
nr_dirty 45436
nr_dirty 257
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16

Basicly they are
- during the dd: ~16M
- after 30s: ~4M
- after 5s: ~4M
- after 5s: ~176M

The box has 2G memory.

Question 1:
How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.

Question 2:
__block_write_full_page() is virtually doing nothing for the whole dirty file.
Isn't it abnormal? Who did the actual write back for us? The jounal? How to fix it?

Any suggestions? Thank you.

Cc: Ken Chen <kenchen@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Fengguang Wu <wfg@xxxxxxxxxxxxxxxx>
---
fs/buffer.c | 1 -
1 file changed, 1 deletion(-)

--- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
+++ linux-2.6.23-rc2-mm2/fs/buffer.c
@@ -1713,7 +1713,6 @@ done:
* The page and buffer_heads can be released at any time from
* here on.
*/
- wbc->pages_skipped++; /* We didn't write this page */
}
return err;


--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/