Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline

From: Paul E. McKenney
Date: Fri May 29 2020 - 17:12:47 EST


On Fri, May 29, 2020 at 12:55:43PM -0700, Bart Van Assche wrote:
> On 2020-05-29 11:13, Paul E. McKenney wrote:
> > On Fri, May 29, 2020 at 11:53:15AM +0800, Ming Lei wrote:
> >> Another pair is in blk_mq_get_tag(), and we expect the following two
> >> memory OPs are ordered:
> >>
> >> 1) set bit in successful test_and_set_bit_lock(), which is called
> >> from sbitmap_get()
> >>
> >> 2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
> >>
> >> Do you think that the above two OPs are ordered?
> >
> > Given that he has been through the code, I would like to hear Bart's
> > thoughts, actually.
>
> Hi Paul,
>
> My understanding of the involved instructions is as follows (see also
> https://lore.kernel.org/linux-block/b98f055f-6f38-a47c-965d-b6bcf4f5563f@xxxxxxxxxx/T/#t
> for the entire e-mail thread):
> * blk_mq_hctx_notify_offline() sets the BLK_MQ_S_INACTIVE bit in
> hctx->state, calls smp_mb__after_atomic() and waits in a loop until all
> tags have been freed. Each tag is an integer number that has a 1:1
> correspondence with a block layer request structure. The code that
> iterates over block layer request tags relies on
> __sbitmap_for_each_set(). That function examines both the 'word' and
> 'cleared' members of struct sbitmap_word.
> * What blk_mq_hctx_notify_offline() waits for is freeing of tags by
> blk_mq_put_tag(). blk_mq_put_tag() frees a tag by setting a bit in
> sbitmap_word.cleared (see also sbitmap_deferred_clear_bit()).
> * Tag allocation by blk_mq_get_tag() relies on test_and_set_bit_lock().
> The actual allocation happens by sbitmap_get() that sets a bit in
> sbitmap_word.word. blk_mg_get_tag() tests the BLK_MQ_S_INACTIVE bit
> after tag allocation succeeded.
>
> What confuses me is that blk_mq_hctx_notify_offline() uses
> smp_mb__after_atomic() to enforce the order of memory accesses while
> blk_mq_get_tag() relies on the acquire semantics of
> test_and_set_bit_lock(). Usually ordering is enforced by combining two
> smp_mb() calls or by combining a store-release with a load-acquire.
>
> Does the Linux memory model provide the expected ordering guarantees
> when combining load-acquire with smp_mb__after_atomic() as used in patch
> 8/8 of this series?

Strictly speaking, smp_mb__after_atomic() works only in combination
with a non-value-returning atomic operation. Let's look at a (silly)
example where smp_mb__after_atomic() would not help in conjunction
with smp_store_release():

void thread1(void)
{
smp_store_release(&x, 1);
smp_mb__after_atomic();
r1 = READ_ONCE(y);
}

void thread2(void)
{
smp_store_release(&y, 1);
smp_mb__after_atomic();
r2 = READ_ONCE(x);
}

Even on x86 (or perhaps especially on x86) it is quite possible that
execution could end with r1 == r2 == 0 because on x86 there is no
ordering whatsoever from smp_mb__after_atomic(). In this case,
the CPU is well within its rights to reorder each thread's store
with its later load. Yes, even x86.

On the other hand, suppose that the stores are non-value-returning
atomics:

void thread1(void)
{
atomic_inc(&x);
smp_mb__after_atomic();
r1 = READ_ONCE(y);
}

void thread2(void)
{
atomic_inc(&y);
smp_mb__after_atomic();
r2 = READ_ONCE(x);
}

In this case, for all architectures, there would be the equivalent
of an smp_mb() full barrier associated with either the atomic_inc()
or the smp_mb__after_atomic(), which would rule out the case where
execution ends with r1 == r2 == 0.

Does that help?

Thanx, Paul