Re: [External] Re: [PATCH v5] blk-throtl: Introduce sync and async queues for blk-throtl

From: hanjinke
Date: Thu Jan 12 2023 - 03:45:06 EST




在 2023/1/12 上午1:05, Tejun Heo 写道:
On Thu, Jan 12, 2023 at 12:20:30AM +0800, Jinke Han wrote:
From: Jinke Han <hanjinke.666@xxxxxxxxxxxxx>

Now we don't distinguish sync write ios from normal buffer write ios
in blk-throtl. A bio with REQ_SYNC tagged always mean it will be wait
until write completion soon after it submit. So it's reasonable for sync
io to complete as soon as possible.

In our test, fio writes a 100g file in sequential 4k blocksize in
a container with low bps limit configured (wbps=10M). More than 1200
ios were throttled in blk-throtl queue and the avarage throtle time
of each io is 140s. At the same time, the operation of saving a small
file by vim will be blocked amolst 140s. As a fsync will be send by vim,
the sync ios of fsync will be blocked by a huge amount of buffer write
ios ahead. This is also a priority inversion problem within one cgroup.
In the database scene, things got really bad with blk-throtle enabled
as fsync is called very often.

This patch splits bio queue into sync and async queues for blk-throtl
and gives a huge priority to sync write ios. Sync queue only make sense
for write ios as we treat all read io as sync io. I think it's a nice
respond to the semantics of REQ_SYNC. Bios with REQ_META and REQ_PRIO
gains the same priority as they are important to fs. This may avoid
some potential priority inversion problems.

With this patch, do the same test above, the duration of the fsync sent
by vim drops to several hundreds of milliseconds.

Signed-off-by: Jinke Han <hanjinke.666@xxxxxxxxxxxxx>
Signed-off-by: Xin Yin <yinxin.x@xxxxxxxxxxxxx>

Acked-by: Tejun Heo <tj@xxxxxxxxxx>

with some nits below:

+/**
+ * throtl_qnode_bio_peek - peek a bio from a qn
+ * @qn: the qnode to peek from
+ *
+ * For read, always peek bio from the SYNC queue.
+ *
+ * For write, we always peek bio from next_to_disp. If it's NULL, a bio
^
first

+ * will be popped from SYNC or ASYNC queue to fill it. The next_to_disp
+ * is used to make sure that the peeked bio and the next popped bio are
^
previously

+ * always the same even in case that the spinlock of queue was released
+ * and re-holded.
^
re-grabbed / re-acquired
+ *
+ * Without the next_to_disp, consider the following situation:
^^^^^^^^^^^^^^^^^^^^^^^^^^
maybe drop this part and move the latter part to the end of the
previous para?

+ * Assumed that there are only bios queued in ASYNC queue and the SYNC
^
Assume

+ * queue is empty and all ASYNC bios are 1M in size and the bps limit is
+ * 1M/s. The throtl_slice is 100ms. The current slice is [jiffies1,
+ * jiffies1+100] and the bytes_disp[w] is 0.
+ *
+ * The disp_sync_cnt is 0 as it was set 0 after each dispatching of a
+ * ASYNC bio. A ASYNC bio wil be peeked to check in tg_may_dispatch.
+ * Obviously, it can't be dispatched in current slice and the wait time
+ * is 900ms. The slice will be extended to [jiffies1, jiffies1+1000] in
+ * tg_may_dispatch. The spinlock of the queue will be released after the
+ * process of dispatch giving up. A 4k size SYNC bio was queued in and
+ * the SYNC queue becomes no-empty. After 900ms, it's time to dispatch
+ * the tg, the SYNC bio will be popped to dispatched as the disp_sync_cnt
+ * is 0 and the SYNC queue is no-empty. The slice will be extended to
^
Maybe combine the previous several sentences like:

The queue lock is released and a 4k SYNC bio gets queued during the 900ms
wait.

+ * [jiffies1, jiffies1+1100] in tg_may_dispatch. Then the slice will be
+ * trimed to [jiffies1+1000, jiffies1+1100] after the SYNC bio was
+ * dispatched. Then the former 1M size ASYNC bio will be peeked to be
+ * checked and still can't be dispatched because of overlimit within
+ * the current slice. The same thing may happen DISPACH_SYNC_FACTOR times
+ * if always there is a SYNC bio be queued in the SYNC queue when the
+ * ASYNC bio is waiting. This means that in nearly 5s, we have dispathed
+ * four 4k SYNC bios and one 1M ASYNC bio. It is hard to fill up the
+ * bandwidth considering that the bps limit is 1M/s.

Simiarly I think the information can be conveyed in a more compact form.

Thanks.

The comment will be further adjusted based on your suggestions and the
v6 with your Acked-by will be send.

Thanks
Jinke