Re: [RFC PATCH 4/7] memcg: sleep during flushing stats in safe contexts

From: Shakeel Butt
Date: Thu Mar 23 2023 - 15:35:40 EST


On Thu, Mar 23, 2023 at 11:08 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Thu, Mar 23, 2023 at 10:27 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > > > > return;
> > > > >
> > > > > flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > > > + cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> > > >
> > > > How is it safe to call this with may_sleep=true when it's holding the
> > > > stats_flush_lock?
> > >
> > > stats_flush_lock is always called with trylock, it is only used today
> > > so that we can skip flushing if another cpu is already doing a flush
> > > (which is not 100% correct as they may have not finished flushing yet,
> > > but that's orthogonal here). So I think it should be safe to sleep as
> > > no one can be blocked waiting for this spinlock.
> >
> > I see. It still cannot sleep while the lock is held, though, because
> > preemption is disabled. Make sure you have all lock debugging on while
> > testing this.
>
> Thanks for pointing this out, will do.
>
> >
> > > Perhaps it would be better semantically to replace the spinlock with
> > > an atomic test and set, instead of having a lock that can only be used
> > > with trylock?
> >
> > It could be helpful to clarify what stats_flush_lock is protecting
> > first. Keep in mind that locks should protect data, not code paths.
> >
> > Right now it's doing multiple things:
> >
> > 1. It protects updates to stats_flush_threshold
> > 2. It protects updates to flush_next_time
> > 3. It serializes calls to cgroup_rstat_flush() based on those ratelimits
> >
> > However,
> >
> > 1. stats_flush_threshold is already an atomic
> >
> > 2. flush_next_time is not atomic. The writer is locked, but the reader
> > is lockless. If the reader races with a flush, you could see this:
> >
> > if (time_after(jiffies, flush_next_time))
> > spin_trylock()
> > flush_next_time = now + delay
> > flush()
> > spin_unlock()
> > spin_trylock()
> > flush_next_time = now + delay
> > flush()
> > spin_unlock()
> >
> > which means we already can get flushes at a higher frequency than
> > FLUSH_TIME during races. But it isn't really a problem.
> >
> > The reader could also see garbled partial updates, so it needs at
> > least READ_ONCE and WRITE_ONCE protection.
> >
> > 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> > factors is currently broken because of the race in 2. But the race
> > is actually harmless, all we might get is the occasional earlier
> > flush. If there is no delta, the flush won't do much. And if there
> > is, the flush is justified.
> >
> > In summary, it seems to me the lock can be ditched altogether. All the
> > code needs is READ_ONCE/WRITE_ONCE around flush_next_time.
>
> Thanks a lot for this analysis. I agree that the lock can be removed
> with proper READ_ONCE/WRITE_ONCE, but I think there is another purpose
> of the lock that we are missing here.
>
> I think one other purpose of the lock is avoiding a thundering herd
> problem on cgroup_rstat_lock, particularly from reclaim context, as
> mentioned by the log of commit aa48e47e3906 ("memcg: infrastructure
> to flush memcg stats").
>
> While testing, I did notice that removing this lock indeed causes a
> thundering herd problem if we have a lot of concurrent reclaimers. The
> trylock makes sure we abort immediately if someone else is flushing --
> which is not ideal because that flusher might have just started, and
> we may end up reading stale data anyway.
>
> This is why I suggested replacing the lock by an atomic, and do
> something like this if we want to maintain the current behavior:
>
> static void __mem_cgroup_flush_stats(void)
> {
> ...
> if (atomic_xchg(&ongoing_flush, 1))
> return;
> ...
> atomic_set(&ongoing_flush, 0)
> }
>
> Alternatively, if we want to change the behavior and wait for the
> concurrent flusher to finish flushing, we can maybe spin until
> ongoing_flush goes back to 0 and then return:
>
> static void __mem_cgroup_flush_stats(void)
> {
> ...
> if (atomic_xchg(&ongoing_flush, 1)) {
> /* wait until the ongoing flusher finishes to get updated stats */
> while (atomic_read(&ongoing_flush) {};
> return;
> }
> /* flush the stats ourselves */
> ...
> atomic_set(&ongoing_flush, 0)
> }
>
> WDYT?

I would go with your first approach i.e. no spinning.