Re: [RFC PATCH] Bio Throttling support for block IO controller

From: Paul E. McKenney
Date: Fri Sep 03 2010 - 19:36:33 EST


On Thu, Sep 02, 2010 at 09:57:39PM -0400, Vivek Goyal wrote:
> On Thu, Sep 02, 2010 at 11:39:32AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 01, 2010 at 01:58:30PM -0400, Vivek Goyal wrote:
> > > Hi,
> > >
> > > Currently CFQ provides the weight based proportional division of bandwidth.
> > > People also have been looking at extending block IO controller to provide
> > > throttling/max bandwidth control.
> > >
> > > I have started to write the support for throttling in block layer on
> > > request queue so that it can be used both for higher level logical
> > > devices as well as leaf nodes. This patch is still work in progress but
> > > I wanted to post it for early feedback.
> > >
> > > Basically currently I have hooked into __make_request() function to
> > > check which cgroup bio belongs to and if it is exceeding the specified
> > > BW rate. If no, thread can continue to dispatch bio as it is otherwise
> > > bio is queued internally and dispatched later with the help of a worker
> > > thread.
> > >
> > > HOWTO
> > > =====
> > > - Mount blkio controller
> > > mount -t cgroup -o blkio none /cgroup/blkio
> > >
> > > - Specify a bandwidth rate on particular device for root group. The format
> > > for policy is "<major>:<minor> <byes_per_second>".
> > >
> > > echo "8:16 1048576" > /cgroup/blkio/blkio.read_bps_device
> > >
> > > Above will put a limit of 1MB/second on reads happening for root group
> > > on device having major/minor number 8:16.
> > >
> > > - Run dd to read a file and see if rate is throttled to 1MB/s or not.
> > >
> > > # dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 iflag=direct
> > > 1024+0 records in
> > > 1024+0 records out
> > > 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
> > >
> > > Limits for writes can be put using blkio.write_bps_device file.
> > >
> > > Open Issues
> > > ===========
> > > - Do we need to provide additional queue congestion semantics as we are
> > > throttling and queuing bios at request queue and probably we don't want
> > > a user space application to consume all the memory allocating bios
> > > and bombarding request queue with those bios.
> > >
> > > - How to handle the current blkio cgroup stats file and two policies
> > > in the background. If for some reason both throttling and proportional
> > > BW policies are operating on request queue, then stats will be very
> > > confusing.
> > >
> > > May be we can allow activating either throttling or proportional BW
> > > policy per request queue and we can create a /sys tunable to list and
> > > chose between policies (something like choosing IO scheduler). The
> > > only downside of this apporach is that user also need to be aware of
> > > the storage hierachy and activate right policy at each node/request
> > > queue.
> > >
> > > TODO
> > > ====
> > > - Lots of testing, bug fixes.
> > > - Provide support for enforcing limits in IOPS.
> > > - Extend the throttling support for dm devices also.
> > >
> > > Any feedback is welcome.
> > >
> > > Thanks
> > > Vivek
> > >
> > > o IO throttling support in block layer.
> > >
> > > Signed-off-by: Vivek Goyal <vgoyal@xxxxxxxxxx>
> > > ---
> > > block/Makefile | 2
> > > block/blk-cgroup.c | 282 +++++++++++--
> > > block/blk-cgroup.h | 44 ++
> > > block/blk-core.c | 28 +
> > > block/blk-throttle.c | 928 ++++++++++++++++++++++++++++++++++++++++++++++
> > > block/blk.h | 4
> > > block/cfq-iosched.c | 4
> > > include/linux/blk_types.h | 3
> > > include/linux/blkdev.h | 22 +
> > > 9 files changed, 1261 insertions(+), 56 deletions(-)
> > >
> >
> > [ . . . ]
> >
> > > +void blk_throtl_exit(struct request_queue *q)
> > > +{
> > > + struct throtl_data *td = q->td;
> > > + bool wait = false;
> > > +
> > > + BUG_ON(!td);
> > > +
> > > + throtl_shutdown_timer_wq(q);
> > > +
> > > + spin_lock_irq(q->queue_lock);
> > > + throtl_release_tgs(td);
> > > + blkiocg_del_blkio_group(&td->root_tg.blkg);
> > > +
> > > + /* If there are other groups */
> > > + if (td->nr_undestroyed_grps >= 1)
> > > + wait = true;
> > > +
> > > + spin_unlock_irq(q->queue_lock);
> > > +
> > > + /*
> > > + * Wait for tg->blkg->key accessors to exit their grace periods.
> > > + * Do this wait only if there are other undestroyed groups out
> > > + * there (other than root group). This can happen if cgroup deletion
> > > + * path claimed the responsibility of cleaning up a group before
> > > + * queue cleanup code get to the group.
> > > + *
> > > + * Do not call synchronize_rcu() unconditionally as there are drivers
> > > + * which create/delete request queue hundreds of times during scan/boot
> > > + * and synchronize_rcu() can take significant time and slow down boot.
> > > + */
> > > + if (wait)
> > > + synchronize_rcu();
> >
> > The RCU readers are presumably not accessing the structure referenced
> > by td? If they can access it, then they will be accessing freed memory
> > after the following function call!!!
>
> Hi Paul,
>
> Thanks for the review.
>
> As per my understanding if wait = false, then there should not be any
> RCU readers of tg->blkg->key (key is basically struct throtl_data *td) out
> there hence it should be safe to to free "td" without calling
> synchronize_rcu() or call_rcu().
>
> Following are some details.
>
> - We instanciate some throtl_grp structures as IO happens in a cgropu and
> these objects are put in a hash list (td->tg_list). These objects are
> put into another cgroup list (blkcg->blkg_list, blk-cgroup.c).
>
> Root group is only exception which is not allocated dynamically instead it
> is statically allocated as part of throtl_data structure.
> (struct throtl_grp root_tg);
>
> - There are two group deletion paths. One is if cgroup is being deleted
> then we need to cleanup associated group and other is if device is
> going away then we need to cleanup all groups and td and request queue
> etc.
>
> - The only user of RCU protected tg->blkg->key is cgroup deletion path
> and that path will be accessing this key only if it got the ownership
> of a group it wants to delete. Basically group deletion path can race
> between cgroup deletion event and device going away at the same time.
>
> In this case, both path will want to clean up a group and some kind of
> arbitration is needed. The path which is first able to take blkcg->lock
> and is able to delete group from blkcg->blkg_list, takes the
> responsibility of cleaning up the group.
>
> Now if there are no undestroyed groups (except root group which cgroup
> path will never try to destroy as root cgroup is not deleted), that
> means cgroup path will not try to free up any groups, that also means
> that there will be no other RCU readers of tg->blkg->key and hence
> it should be safe to free up "td" without synchronize_rcu()
> or call_rcu(). Am I missing something?

If I understand you correctly, RCU is used only for part of the data
structure, and if you are not freeing up an RCU-traversed portion of
the data structure, then there is no need to wait for a grace period.

Thanx, Paul

> > If they can access it, I suggest using call_rcu() instead of
> > synchronize_rcu(). One way of doing this would be:
> >
> > if (!wait) {
> > call_rcu(&td->rcu, throtl_td_deferred_free);
>
> if !wait, then as per my current understanding there are no RCU readers
> out there and above step should not be required. The reason I don't want
> to use call_rcu() is that though it will keep "td" around but request
> queue will be gone (td->queue) and RCU reader path take request queue
> spin lock and they will be trying to take lock which has been freed.
>
> throtl_unlink_blkio_group() {
> spin_lock_irqsave(td->queue->queue_lock, flags);
> }
>
>
> > } else {
> > synchronize_rcu();
> > throtl_td_free(td);
> > }
>
> This is the step my code is already doing. If wait=true, then there are
> RCU readers out there and wait for them to finish before freeing up
> td.
>
> Thanks
> Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/