Re: [PATCH v2 6/6] zswap: memcg accounting

From: Michal Koutný
Date: Mon May 16 2022 - 10:35:09 EST


On Fri, May 13, 2022 at 01:08:13PM -0400, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> Right, it's accounted as a subset rather than fully disjointed. But it
> is a limitable counter of its own, so I exported it as such, with a
> current and a max knob. This is comparable to the kmem counter in v1.

That counter and limit didn't turn out well. I liked the analogy to
writeback (and dirty limit) better.

> From an API POV it would be quite strange to have max for a counter
> that has no current. Likewise it would be strange for a major memory
> consumer to be missing from memory.stat.

My understanding would be to have all memory.stat entries as you
propose, no extra .current counter and the .max knob for zswap
configuration.

> It needs to be configured to the workload's access frequency curve,
> which can be done with trial-and-error (reasonable balance between
> zswpins and pswpins) or in a more targeted manner using tools such as
> page_idle, damon etc.
> [...]
> Because for load tuning, bytes make much more sense. That's how you
> measure the workingset, so a percentage is an awkward indirection. At
> the cgroup level, it makes even less sense: all memcg tunables are in
> bytes, it would be quite weird to introduce a "max" that is 0-100. Add
> the confusion of how percentages would propagate down the hierarchy...

Thanks for the explanation. I guess there's no simple tranformation of
in-kernel available information that'd allow a more semantic
configuration of this value. The rather crude absolute value requires
(but also simply allows) some calibration or responsive tuning.

> I don't traverse all ancestors, I bail on disabled groups and skip
> unlimited ones.

I admit I missed that.

> Flushing unnecessary groups with a ratelimit doesn't sound like an
> improvement to me.

Then I'm only concerned about a situation when there's a single deep
memcg that undergoes both workingset_refault() and zswap querying.
The latter (bare call to cgroup_rstat_flush()) won't reset
stats_flush_threshold, so the former (or the async flush more likely)
would attempt a flush too. The flush work (on the leaf memcg) would be
done twice even though it may be within the tolerance of cumulated
error the second time.

This is a thing that might require attention in the future (depending on
some data how it actually performs). I see how the current approach is
justified.


Regards,
Michal