Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus

From: Leonardo Brás
Date: Fri Nov 04 2022 - 21:47:06 EST


On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> [...]
> > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > a) The isolated CPU is requesting the stock drain,
> > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > the first time after a remote drain.
> > >
> > > And anytime the charging path (consume_stock resp. refill_stock)
> > > contends with the remote draining which is out of control of the RT
> > > task. It is true that the RT kernel will turn that spin lock into a
> > > sleeping RT lock and that could help with potential priority inversions
> > > but still quite costly thing I would expect.
> > >
> > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > fine.
> > >
> > > Now I am not sure I understand. If you do not consider charging path to
> > > be RT sensitive then why is this needed in the first place? What else
> > > would be populating the pcp cache on the isolated cpu? IRQs?
> >
> > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > time with the RT workload, we can have preemption of the RT workload, which is a
> > problem for meeting the deadlines.
>
> Yes, this is understood. But it is not really clear to me why would any
> draining be necessary for such an isolated CPU if no workload other than
> the RT (which pressumably doesn't charge any memory?) is running on that
> CPU? Is that the RT task during the initialization phase that leaves
> that cache behind or something else?

(I am new to this part of the code, so please correct me when I miss something.)

IIUC, if a process belongs to a control group with memory control, the 'charge'
will happen when a memory page starts getting used by it.

So, if we assume a RT load in a isolated CPU will not charge any memory, we are
assuming it will never be part of a memory-controlled cgroup.

I mean, can we just assume this?

If I got that right, would not that be considered a limitation? like
"If you don't want your workload to be interrupted by perCPU cache draining,
don't put it in a cgroup with memory control".

> Sorry for being so focused on this
> but I would like to understand on whether this is avoidable by a
> different startup scheme or it really needs to be addressed in some way.

No worries, I am in fact happy you are giving it this much attention :)

I also understand this is a considerable change in the locking strategy, and
avoiding that is the first thing that should be tried.

>
> > One way I thought to solve that was introducing a remote drain, which would
> > require a different strategy for locking, since not all accesses to the pcp
> > caches would happen on a local CPU.
>
> Yeah, I am not supper happy about additional spin lock TBH. One
> potential way to go would be to completely avoid pcp cache for isolated
> CPUs. That would have some performance impact of course but on the other
> hand it would give a more predictable behavior for those CPUs which
> sounds like a reasonable compromise to me. What do you think?

You mean not having a perCPU stock, then?
So consume_stock() for isolated CPUs would always return false, causing
try_charge_memcg() always walking the slow path?

IIUC, both my proposal and yours would degrade performance only when we use
isolated CPUs + memcg. Is that correct?

If so, it looks like the impact would be even bigger without perCPU stock ,
compared to introducing a spinlock.

Unless, we are counting to this case where a remote CPU is draining an isolated
CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
released in the remote CPU. Well, this seems possible to happen, but I would
have to analyze how often would it happen, and how much would it impact the
deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
before its starts, so it can avoid the faulting latency, but I need to confirm
that.

On the other hand, compared to how it works now now, this should be a more
controllable way of introducing latency than a scheduled cache drain.

Your suggestion on no-stocks/caches in isolated CPUs would be great for
predictability, but I am almost sure the cost in overall performance would not
be fine.

With the possibility of prefaulting pages, do you see any scenario that would
introduce some undesirable latency in the workload?

Thanks a lot for the discussion!
Leo