Re: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup

From: Johannes Weiner
Date: Wed Aug 03 2022 - 15:22:25 EST


On Wed, Aug 03, 2022 at 07:58:27AM -1000, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 03, 2022 at 08:17:22PM +0800, Chengming Zhou wrote:
> > > Assuming the above isn't wrong, if we can figure out how we can re-enable
> > > it, which is more difficult as the counters need to be resynchronized with
> > > the current state, that'd be ideal. Then, we can just allow each cgroup to
> > > enable / disable PSI reporting dynamically as they see fit.
> >
> > This method is more fine-grained but more difficult like you said above.
> > I think it may meet most needs to disable PSI stats in intermediate cgroups?
>
> So, I'm not necessarily against implementing something easier but we at
> least wanna get the interface right, so that if we decide to do the full
> thing later we can easily expand on the existing interface. ie. let's please
> not be too hacky. I don't think it'd be that difficult to implement
> per-cgroup disable-only operation that we can later expand to allow
> re-enabling, right?

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level, if
we can live with losing history from while it was disabled. I.e. the
avgs will restart from 0, total= will have gaps - should be okay, IMO.

Where it gets trickier is also stopping the tracking of task counts in
a cgroup. For re-enabling afterwards, we'd have to freeze scheduler
and cgroup state and find all tasks of interest across all CPUs for
the given cgroup to recreate the counts. I'm not quite sure whether
that's feasible, and if so, whether it's worth the savings.

It might be good to benchmark the two disabling steps independently.
Maybe stopping aggregation while keeping task counts is good enough,
and we can commit to a disable/re-enable interface from the start.

Or maybe it's all in the cachelines and iteration, and stopping the
aggregation while still writing task counts isn't saving much. In that
case we'd have to look closer at reconstructing task counts, to see if
later re-enabling is actually a practical option or whether a one-off
kill switch is more realistic.

Chengming, can you experiment with disabling: record_times(), the
test_state() loop and state_mask construction, and the averaging
worker - while keeping the groupc->tasks updates?