Re: [PATCH 2/3] dio: scale unaligned IO tracking via multiple lists

From: Dave Chinner
Date: Tue Nov 09 2010 - 18:06:41 EST

Next message: Minchan Kim: "Re: [PATCH 2/6] memcg: pass mem_cgroup to mem_cgroup_dirty_info()"
Previous message: Andrew Morton: "Re: [PATCH 3/5] writeback: stop background/kupdate works fromlivelocking other works"
In reply to: Jeff Moyer: "Re: [PATCH 2/3] dio: scale unaligned IO tracking via multiple lists"
Next in thread: Jeff Moyer: "Re: [PATCH 2/3] dio: scale unaligned IO tracking via multiple lists"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Nov 09, 2010 at 04:04:41PM -0500, Jeff Moyer wrote:
> Dave Chinner <david@xxxxxxxxxxxxx> writes:
>
> > On Mon, Nov 08, 2010 at 10:36:06AM -0500, Jeff Moyer wrote:
> >> Dave Chinner <david@xxxxxxxxxxxxx> writes:
> >>
> >> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> >> >
> >> > To avoid concerns that a single list and lock tracking the unaligned
> >> > IOs will not scale appropriately, create multiple lists and locks
> >> > and chose them by hashing the unaligned block being zeroed.
> >> >
> >> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> >> > ---
> >> > fs/direct-io.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> >> > 1 files changed, 36 insertions(+), 13 deletions(-)
> >> >
> >> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> >> > index 1a69efd..353ac52 100644
> >> > --- a/fs/direct-io.c
> >> > +++ b/fs/direct-io.c
> >> > @@ -152,8 +152,28 @@ struct dio_zero_block {
> >> > atomic_t ref; /* reference count */
> >> > };
> >> >
> >> > -static DEFINE_SPINLOCK(dio_zero_block_lock);
> >> > -static LIST_HEAD(dio_zero_block_list);
> >> > +#define DIO_ZERO_BLOCK_NR 37LL
> >>
> >> I'm always curious to know how these numbers are derived. Why 37?
> >
> > It's a prime number large enough to give enough lists to minimise
> > contention whilst providing decent distribution for 8 byte aligned
> > addresses with low overhead. XFS uses the same sort of waitqueue
> > hashing for global IO completion wait queues used by truncation
> > and inode eviction (see xfs_ioend_wait()).
> >
> > Seemed reasonable (and simple!) just to copy that design pattern
> > for another global IO completion wait queue....
>
> OK. I just had our performance team record some statistics for me on an
> unmodified kernel during an OLTP-type workload. I've attached the
> systemtap script that I had them run. I wanted to see just how common
> the sub-page-block zeroing was, and I was frightened to find that, in a
> 10 minute period , over 1.2 million calls were recorded. If we're
> lucky, my script is buggy. Please give it a look-see.

Well, it's just checking how many blocks are candidates for zeroing
inside the dio_zero_block() function call. i.e. the function gets
called on every newly allocated block at the start of an IO. Your
result implies that there were 1.2 million IOs requiring allocation
in ten minutes, because the next check in the dio_zero_block():

dio_blocks_per_fs_block = 1 << dio->blkfactor;
this_chunk_blocks = dio->block_in_file & (dio_blocks_per_fs_block - 1);

if (!this_chunk_blocks)
return;

determines if the IO is unaligned and zeroing is really necessary or
not. Your script needs to take this into account, not just count the
number of times the function is called with a new buffer.

> I'm all ears for next steps. We can check to see how deep the hash
> chains get. We could also ask the folks at Intel to run this through
> their database testing rig to get a quantification of the overhead.
>
> What do you think?

Let's run a fixed script first - if databases are really doing so
much unaligned sub-block IO, then they need to be fixed as a matter
of major priority because they are doing far more IO than they need
to be....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Minchan Kim: "Re: [PATCH 2/6] memcg: pass mem_cgroup to mem_cgroup_dirty_info()"
Previous message: Andrew Morton: "Re: [PATCH 3/5] writeback: stop background/kupdate works fromlivelocking other works"
In reply to: Jeff Moyer: "Re: [PATCH 2/3] dio: scale unaligned IO tracking via multiple lists"
Next in thread: Jeff Moyer: "Re: [PATCH 2/3] dio: scale unaligned IO tracking via multiple lists"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]