Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

From: Daniel Colascione
Date: Thu Sep 05 2019 - 18:12:44 EST


On Thu, Sep 5, 2019 at 2:14 PM Tom Zanussi <zanussi@xxxxxxxxxx> wrote:
>
> On Thu, 2019-09-05 at 13:24 -0700, Daniel Colascione wrote:
> > On Thu, Sep 5, 2019 at 12:56 PM Tom Zanussi <zanussi@xxxxxxxxxx>
> > wrote:
> > > On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> > > > On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > > > > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > > > >
> > > > > >
> > > > > > [ Added Tom ]
> > > > > >
> > > > > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > > > > Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko <mhocko@kernel.
> > > > > > > org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > [Add Steven]
> > > > > > > >
> > > > > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko <mhocko@ke
> > > > > > > > > rnel
> > > > > > > > > .org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > > > >
> > > > > > > > [...]
> > > > > > > > > > > but also for reducing
> > > > > > > > > > > tracing noise. Flooding the traces makes it less
> > > > > > > > > > > useful
> > > > > > > > > > > for long traces and
> > > > > > > > > > > post-processing of traces. IOW, the overhead
> > > > > > > > > > > reduction
> > > > > > > > > > > is a bonus.
> > > > > > > > > >
> > > > > > > > > > This is not really anything special for this
> > > > > > > > > > tracepoint
> > > > > > > > > > though.
> > > > > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > > > > situation and I do
> > > > > > > > > > not see a point why each of them should really invent
> > > > > > > > > > its
> > > > > > > > > > own way to
> > > > > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > > > > tracing subsystem
> > > > > > > > > > directly.
> > > > > > > > >
> > > > > > > > > I am not sure if there is a way to do this easily. Add
> > > > > > > > > to
> > > > > > > > > that, the fact that
> > > > > > > > > you still have to call into trace events. Why call into
> > > > > > > > > it
> > > > > > > > > at all, if you can
> > > > > > > > > filter in advance and have a sane filtering default?
> > > > > > > > >
> > > > > > > > > The bigger improvement with the threshold is the number
> > > > > > > > > of
> > > > > > > > > trace records are
> > > > > > > > > almost halved by using a threshold. The number of
> > > > > > > > > records
> > > > > > > > > went from 4.6K to
> > > > > > > > > 2.6K.
> > > > > > > >
> > > > > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > > > > throttling?
> > > > > > >
> > > > > > > I might misunderstand this but is the issue here actually
> > > > > > > throttling
> > > > > > > of the sheer number of trace records or tracing large
> > > > > > > enough
> > > > > > > changes
> > > > > > > to RSS that user might care about? Small changes happen all
> > > > > > > the
> > > > > > > time
> > > > > > > but we are likely not interested in those. Surely we could
> > > > > > > postprocess
> > > > > > > the traces to extract changes large enough to be
> > > > > > > interesting
> > > > > > > but why
> > > > > > > capture uninteresting information in the first place? IOW
> > > > > > > the
> > > > > > > throttling here should be based not on the time between
> > > > > > > traces
> > > > > > > but on
> > > > > > > the amount of change of the traced signal. Maybe a generic
> > > > > > > facility
> > > > > > > like that would be a good idea?
> > > > > >
> > > > > > You mean like add a trigger (or filter) that only traces if a
> > > > > > field has
> > > > > > changed since the last time the trace was hit? Hmm, I think
> > > > > > we
> > > > > > could
> > > > > > possibly do that. Perhaps even now with histogram triggers?
> > > > >
> > > > >
> > > > > Hey Steve,
> > > > >
> > > > > Something like an analog to digitial coversion function where
> > > > > you
> > > > > lose the
> > > > > granularity of the signal depending on how much trace data:
> > > > > https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee3
> > > > > 8d1a
> > > > > 85d37fa23f86a14d3a9776ff67b0ec0f3b.gif
> > > >
> > > > s/how much trace data/what the resolution is/
> > > >
> > > > > so like, if you had a counter incrementing with values after
> > > > > the
> > > > > increments
> > > > > as: 1,3,4,8,12,14,30 and say 5 is the threshold at which to
> > > > > emit a
> > > > > trace,
> > > > > then you would get 1,8,12,30.
> > > > >
> > > > > So I guess what is need is a way to reduce the quantiy of trace
> > > > > data this
> > > > > way. For this usecase, the user mostly cares about spikes in
> > > > > the
> > > > > counter
> > > > > changing that accurate values of the different points.
> > > >
> > > > s/that accurate/than accurate/
> > > >
> > > > I think Tim, Suren, Dan and Michal are all saying the same thing
> > > > as
> > > > well.
> > > >
> > >
> > > There's not a way to do this using existing triggers (histogram
> > > triggers have an onchange() that fires on any change, but that
> > > doesn't
> > > help here), and I wouldn't expect there to be - these sound like
> > > very
> > > specific cases that would never have support in the simple trigger
> > > 'language'.
> >
> > I don't see the filtering under discussion as some "very specific"
> > esoteric need. You need this general kind of mechanism any time you
> > want to monitor at low frequency a thing that changes at high
> > frequency. The general pattern isn't specific to RSS or even memory
> > in
> > general. One might imagine, say, wanting to trace large changes in
> > TCP
> > window sizes. Any time something in the kernel has a "level" and that
> > level changes at high frequency and we want to learn about big swings
> > in that level, the mechanism we're talking about becomes useful. I
> > don't think it should be out of bounds for the histogram mechanism,
> > which is *almost* there right now. We already have the ability to
> > accumulate values derived from ftrace events into tables keyed on
> > various fields in these events and things like onmax().
> >
>
> OK, so with the histograms we already have onchange(), which triggers
> on any change.
>
> Would it be sufficient to just add a 'threshold' param to that i.e.
> onchange(x) means trigger whenever the difference between the new value
> and the previous value is >= x?

By previous value, do you mean previously-reported value or the
previously-set value? If the former, that's good, but we may be able
to do better (see below). If the latter, I don't think that's quite
right, because then we could miss an arbitrarily-large change in
"level" so long as it occurred in sufficiently small steps.

Basically, what I have in mind is this:

1) attach a trigger a tracepoint that contains some
absolutely-specified "level" (say, task private RSS),

2) in the trigger, find the absolute value of the difference between
the new "level" (some field in that tracepoint) and the last "level"
we have for the combination of that value and configurable
partitioning criteria (e.g., pid, uid, cpu, NUMA node) yielding an
absdelta,

3) accumulate absdelta values in some table partitioned on the same
fields as in #2, and

4) after updating accumulated absdelta, evaluate a filter expression,
and if the filter expression evaluates to true, emit a tracepoint
saying "the new value of $LEVEL for $PARTITION is $VALUE" and reset
the accumulated absdelta to zero.

I think this mechanism would give us what we wanted in a general and
powerful way, and we can dial the granularity up or down however want.
The reason I want to accumulate absdelta values instead of just firing
when the previously-reported value differs sufficiently from the
last-set value is so we can tell whether a counter is fluctuating a
lot without its value actually changing. The trigger expression could
then allow any combination of conditions, e.g., "fire a tracepoint
when the accumulated change is greater than 2MB _OR_ a single change
is greater than 1MB _OR_ when we've gone 10 changes of this level
value without reporting the level's new value".

Let me know if this description doesn't make sense.