Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

From: Paul E. McKenney
Date: Wed Apr 13 2022 - 14:07:18 EST


On Wed, Apr 13, 2022 at 01:21:20PM -0400, Joel Fernandes wrote:
> Hi Paul,
>
>
> On Wed, Apr 13, 2022 at 8:07 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 13, 2022 at 07:37:11PM +0800, Hillf Danton wrote:
> > > On Sat, 9 Apr 2022 08:56:12 -0700 Paul E. McKenney wrote:
> > > > On Sat, Apr 09, 2022 at 03:17:40PM +0800, Hillf Danton wrote:
> > > > > On Fri, 8 Apr 2022 10:53:53 -0700 Kalesh Singh wrote
> > > > > > Thanks for the discussion everyone.
> > > > > >
> > > > > > We didn't fully switch to kthread workers to avoid changing the
> > > > > > behavior for users that dont need this low latency exp GPs. Another
> > > > > > (and perhaps more important) reason is because kthread_worker offers
> > > > > > reduced concurrency than workqueues which Pual reported can pose
> > > > > > issues on systems with a large number of CPUs.
> > > > >
> > > > > A second ... what issues were reported wrt concurrency, given the output
> > > > > of grep -nr workqueue block mm drivers.
> > > > >
> > > > > Feel free to post a URL link to the issues.
> > > >
> > > > The issues can be easily seen by inspecting kthread_queue_work() and
> > > > the functions that it invokes. In contrast, normal workqueues uses
> > > > per-CPU mechanisms to avoid contention, as can equally easily be seen
> > > > by inspecting queue_work_on() and the functions that it invokes.
> > >
> > > The worker from kthread_create_worker() roughly matches unbound workqueue
> > > that can get every CPU overloaded, thus the difference in implementation
> > > details between kthread worker and WQ worker (either bound or unbound) can
> > > be safely ignored if the kthread method works, given that prioirty is barely
> > > a cure to concurrency issues.
> >
> > Please look again, this time taking lock contention in to account,
> > keeping in mind that systems with several hundred CPUs are reasonably
> > common and that systems with more than a thousand CPUs are not unheard of.
>
> You are talking about lock contention in the kthread_worker infra
> which unbound WQ does not suffer from, right? I don't think the worker
> lock contention will be an issue unless several
> synchronize_rcu_expedited() calls are trying to queue work at the same
> time. Did I miss something? Considering synchronize_rcu_expedited()
> can block in the normal case (blocking is a pretty heavy operation
> involving the scheduler and load balancers), I don't see how
> contending on the worker infra locks can be an issue. If it was
> call_rcu() , then I can relate to any contention since that executes
> much more often.

Think in terms of a system with 1536 CPUs (which IBM would be extremely
happy to sell you, last I checked). This has 96 leaf rcu_node structures.
Keeping that in mind, take another look at that code.

And in the past there have been real systems with 256 leaf rcu_node
structures.

> I think the argument about too many things being RT is stronger though.

Fair enough. Except that this could be dealt with by conditionally
setting SCHED_FIFO. But the lock contention would remain.

Thanx, Paul

> Thanks,
>
> Joel
>
>
> >
> >
> > Thanx, Paul
> >
> > > Hillf
> > > >
> > > > Please do feel free to take a look.
> > > >
> > > > If taking a look does not convince you, please construct some in-kernel
> > > > benchmarks to test the scalability of these two mechanisms. Please note
> > > > that some care will be required to make sure that you are doing a valid
> > > > apples-to-apples comparison.
> > > >
> > > > Thanx, Paul
> > > >