PATCH - ll_rw_blk + kio + raid - no more WRITERAW

From: Neil Brown (neilb@cse.unsw.edu.au)
Date: Mon Aug 14 2000 - 01:31:56 EST


Hi,
 I have been hunting through ll_rw_block and related code trying to
 understand exactly what is going on and, in particular, what the
 important difference between calling ll_rw_block and
 generic_make_request is.

 The following patch tidies some things up as a result of my
 explorations, and in particular removes the need for WRITERAW.

 Explanation of the patch:

  There seem to be two sorts of calls to the code in ll_rw_blk.c for
  performing block io.

  The first, and most common, is doing io on block devices (as in
  /dev/hda1) and in filesystems. These calls use buffers that are in
  the buffer_cache or the page_cache. WRITE calls are typically only
  used when a "sync" is required, otherwise mark_buffer_dirty is used.

  The second sort of call is made by raid1 and raid5 calling
  generic_make_request directly, and by lvm-snap and drivers/char/raw
  calling brw_kiovec. These IO request are *NOT* out of the
  buffer_cache or page_cache, but are out of special purpose buffers
  created by the driver, used, and then discarded.
  Writes for each of these use WRITERAW to avoid the refile_buffer
  done by __make_request. As these are very similar sorts of IO, the
  fact that the brw_kiovec ones go through ll_rw_block and the raid
  ones go straight to generic_make_request is a bit untidy.

  After thinking this through, it appears to me that the best approach
  is to treat all io that goes to generic_make_request as "raw", and
  generic_make_request or anything below it should not fiddle with
  BH_{Dirty,Lock,Uptodate} (or probably anything else), and should
  definately not refile the buffer.
  Any IO requests that are on cache buffers and want the refiling etc
  should use ll_rw_block.

  To achieve this I:
   - lifted the tests on BH_Updtodate and BH_Dirty, and the
     refile_buffer out of __make_request and put them in ll_rw_block.
   - also lifted kstat.pgpg{in,out}++ out into ll_rw_block, which
     feels correct.
   - nchanged brw_kiovec to call generic_make_request on each buffer
     rather than gathering them and passing them to ll_rw_block. They
     are still gathered for waiting.
   - changed raid1/5 to use WRITE instead of WRITERAW as the things
     that WRITERAW protects against are no longer under
     generic_make_request.
   - ripped the "allclean" tests of out raid5 as they cannot be done
     anymore. That shouldn't be an issue though, as they were for
     optimising performance when parity rebuild went through the
     buffer cache, and it doesn't any more.

  In doing this, I noticed that io errors aren't propergated up out
  of wait_kio (previously through do_kio) though brw_kiovec properly.
  If wait_kio needs to be called more than once (because more than
  KIO_MAX_SECTORS buffers are used), and if there is an IO error in
  the first set of buffers, then the value returned by brw_kiovec will
  suggest that the failed block had actually been transferred correctly.
  I don't really want to dive in an figure out the best way to fix
  this, and I don't know who the "maintainer" is... any takers?

The code compiles fine, and raid5 has no problems running. I am not
in a position to exercise the code paths in brw_kiovec, but the
changes are sufficiently simple that I feel convinced of the
correctness. Anyone want to second the motion?

NeilBrown

--- ./fs/buffer.c 2000/08/13 23:54:00 1.1
+++ ./fs/buffer.c 2000/08/13 23:58:02
@@ -1781,15 +1781,12 @@
  * for them to complete. Clean up the buffer_heads afterwards.
  */
 
-static int do_kio(int rw, int nr, struct buffer_head *bh[], int size)
+static int wait_kio(int rw, int nr, struct buffer_head *bh[], int size)
 {
         int iosize;
         int i;
         struct buffer_head *tmp;
 
- if (rw == WRITE)
- rw = WRITERAW;
- ll_rw_block(rw, nr, bh);
 
         iosize = 0;
         spin_lock(&unused_list_lock);
@@ -1905,12 +1902,13 @@
                                 offset += size;
 
                                 atomic_inc(&iobuf->io_count);
-
+
+ generic_make_request(rw, tmp);
                                 /*
- * Start the IO if we have got too much
+ * Wait for IO if we have got too much
                                  */
                                 if (bhind >= KIO_MAX_SECTORS) {
- err = do_kio(rw, bhind, bh, size);
+ err = wait_kio(rw, bhind, bh, size);
                                         if (err >= 0)
                                                 transferred += err;
                                         else
@@ -1928,7 +1926,7 @@
 
         /* Is there any IO still left to submit? */
         if (bhind) {
- err = do_kio(rw, bhind, bh, size);
+ err = wait_kio(rw, bhind, bh, size);
                 if (err >= 0)
                         transferred += err;
                 else
--- ./include/linux/fs.h 2000/08/13 23:58:50 1.1
+++ ./include/linux/fs.h 2000/08/14 04:35:02
@@ -71,8 +71,6 @@
 #define READA 2 /* read-ahead - don't block if no resources */
 #define SPECIAL 4 /* For non-blockdevice requests in request queue */
 
-#define WRITERAW 5 /* raw write - don't play with buffer lists */
-
 #define SEL_IN 1
 #define SEL_OUT 2
 #define SEL_EX 4
--- ./drivers/block/ll_rw_blk.c 2000/08/13 23:40:48 1.2
+++ ./drivers/block/ll_rw_blk.c 2000/08/14 05:04:56
@@ -696,19 +696,7 @@
                         rw_ahead = 1;
                         rw = READ; /* drop into READ */
                 case READ:
- if (buffer_uptodate(bh)) /* Hmmph! Already have it */
- goto end_io;
- kstat.pgpgin++;
- break;
- case WRITERAW:
- rw = WRITE;
- goto do_write; /* Skip the buffer refile */
                 case WRITE:
- if (!test_and_clear_bit(BH_Dirty, &bh->b_state))
- goto end_io; /* Hmmph! Nothing to write */
- refile_buffer(bh);
- do_write:
- kstat.pgpgout++;
                         break;
                 default:
                         BUG();
@@ -940,6 +928,30 @@
                         continue;
 
                 set_bit(BH_Req, &bh->b_state);
+
+ switch(rw) {
+ case WRITE:
+ if (!atomic_set_buffer_clean(bh))
+ /* Hmmph! Nothing to write */
+ goto end_io;
+ __mark_buffer_clean(bh);
+ kstat.pgpgout++;
+ break;
+
+ case READA:
+ case READ:
+ if (buffer_uptodate(bh))
+ /* Hmmph! Already have it */
+ goto end_io;
+ kstat.pgpgin++;
+ break;
+ default:
+ BUG();
+ end_io:
+ bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
+ continue;
+
+ }
 
                 /*
                  * First step, 'identity mapping' - RAID or LVM might
--- ./drivers/block/raid1.c 2000/08/13 23:59:08 1.1
+++ ./drivers/block/raid1.c 2000/08/14 00:01:23
@@ -564,28 +564,6 @@
         if (rw == READA)
                 rw = READ;
 
- if (rw == WRITE) {
- rw = WRITERAW;
- /*
- * we first clean the bh, then we start the IO, then
- * when the IO has finished, we end_io the bh and
- * mark it uptodate. This way we do not miss the
- * case when the bh got dirty again during the IO.
- *
- * We do an important optimization here - if the
- * buffer was not dirty and we are during resync or
- * reconstruction, then we can skip writing it back
- * to the master disk! (we still have to write it
- * back to the other disks, because we are not sync
- * yet.)
- */
- if (atomic_set_buffer_clean(bh))
- __mark_buffer_clean(bh);
- else {
- bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
- return 0;
- }
- }
         r1_bh = raid1_alloc_r1bh (conf);
 
         spin_lock_irq(&conf->segment_lock);
--- ./drivers/block/raid5.c 2000/08/14 00:06:43 1.1
+++ ./drivers/block/raid5.c 2000/08/14 00:11:43
@@ -863,8 +863,7 @@
                 if (!sh->bh_copy[i])
                         sh->bh_copy[i] = raid5_alloc_buffer(sh, sh->size);
                 raid5_build_block(sh, sh->bh_copy[i], i);
- if (atomic_set_buffer_clean(sh->bh_new[i]))
- atomic_set_buffer_dirty(sh->bh_copy[i]);
+ atomic_set_buffer_dirty(sh->bh_copy[i]);
                 memcpy(sh->bh_copy[i]->b_data, sh->bh_new[i]->b_data, sh->size);
         }
         if (sh->bh_copy[pd_idx] == NULL) {
@@ -995,28 +994,12 @@
 }
 
 
-static int is_stripe_allclean(struct stripe_head *sh, int disks)
-{
- int i;
-
- return 0;
- for (i = 0; i < disks; i++) {
- if (sh->bh_new[i])
- if (test_bit(BH_Dirty, &sh->bh_new[i]))
- return 0;
- if (sh->bh_old[i])
- if (test_bit(BH_Dirty, &sh->bh_old[i]))
- return 0;
- }
- return 1;
-}
-
 static void handle_stripe_write (mddev_t *mddev , raid5_conf_t *conf,
         struct stripe_head *sh, int nr_write, int * operational, int disks,
         int parity, int parity_failed, int nr_cache, int nr_cache_other,
         int nr_failed_other, int nr_cache_overwrite, int nr_failed_overwrite)
 {
- int i, allclean;
+ int i;
         unsigned int block;
         struct buffer_head *bh;
         int method1 = INT_MAX, method2 = INT_MAX;
@@ -1068,7 +1051,6 @@
         PRINTK("handle_stripe(), sector %lu, nr_write %d, method1 %d, method2 %d\n", sh->sector, nr_write, method1, method2);
 
         if (!method1 || !method2) {
- allclean = is_stripe_allclean(sh, disks);
                 sh->phase = PHASE_WRITE;
                 compute_parity(sh, method1 <= method2 ? RECONSTRUCT_WRITE : READ_MODIFY_WRITE);
 
@@ -1087,24 +1069,11 @@
                                         PRINTK("writing spare %d\n", i);
                                         atomic_inc(&sh->nr_pending);
                                         bh->b_dev = bh->b_rdev = conf->spare->dev;
- generic_make_request(WRITERAW, bh);
+ generic_make_request(WRITE, bh);
                                 } else {
-#if 0
                                         atomic_inc(&sh->nr_pending);
                                         bh->b_dev = bh->b_rdev = conf->disks[i].dev;
- generic_make_request(WRITERAW, bh);
-#else
- if (!allclean || (i==sh->pd_idx)) {
- PRINTK("writing dirty %d\n", i);
- atomic_inc(&sh->nr_pending);
- bh->b_dev = bh->b_rdev = conf->disks[i].dev;
- generic_make_request(WRITERAW, bh);
- } else {
- PRINTK("not writing clean %d\n", i);
- raid5_end_request(bh, 1);
- sh->new[i] = 0;
- }
-#endif
+ generic_make_request(WRITE, bh);
                                 }
                                 atomic_dec(&bh->b_count);
                         }
@@ -1282,7 +1251,7 @@
                                 atomic_inc(&sh->nr_pending);
                                 lock_get_bh(bh);
                                 bh->b_dev = bh->b_rdev = conf->spare->dev;
- generic_make_request(WRITERAW, bh);
+ generic_make_request(WRITE, bh);
                                 md_sync_acct(bh->b_rdev, bh->b_size/512);
                                 atomic_dec(&bh->b_count);
                 PRINTK("handle_stripe_sync() %lu, phase WRITE, pending %d buffers\n", sh->sector, md_atomic_read(&sh->nr_pending));
@@ -1308,7 +1277,7 @@
                 lock_get_bh(bh);
                 atomic_inc(&sh->nr_pending);
                 bh->b_dev = bh->b_rdev = conf->disks[pd_idx].dev;
- generic_make_request(WRITERAW, bh);
+ generic_make_request(WRITE, bh);
                 md_sync_acct(bh->b_rdev, bh->b_size/512);
                 atomic_dec(&bh->b_count);
                 PRINTK("handle_stripe_sync() %lu phase WRITE, pending %d buffers\n",

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Aug 15 2000 - 21:00:32 EST