Re: [PATCH 6/6] mm: Drain LRUs upon resume to userspace on nohz_full CPUs
From: Vlastimil Babka
Date: Thu Jul 17 2025 - 15:33:46 EST
On 7/3/25 18:12, Michal Hocko wrote:
> On Thu 03-07-25 15:28:23, Matthew Wilcox wrote:
>> On Thu, Jul 03, 2025 at 04:07:17PM +0200, Frederic Weisbecker wrote:
>> > +unsigned int folio_batch_add(struct folio_batch *fbatch,
>> > + struct folio *folio)
>> > +{
>> > + unsigned int ret;
>> > +
>> > + fbatch->folios[fbatch->nr++] = folio;
>> > + ret = folio_batch_space(fbatch);
>> > + isolated_task_work_queue();
>>
>> Umm. LRUs use folio_batches, but they are definitely not the only user
>> of folio_batches. Maybe you want to add a new lru_batch_add()
>> abstraction, because this call is definitely being done at the wrong
>> level.
>
> You have answered one of my question in other response. My initial
> thought was that __lru_add_drain_all seems to be a better fit. But then
__lru_add_drain_all is the part where we queue the drain work to other cpus.
In order to not disrupt isolated cpus, the isolated cpus have to self-
schedule the work to be done on resume, which indeed means at the moment
they are filling the batches to be drained, as this patch does.
> we have a problem that draining will become an unbounded operation which
> will become a problem for lru_cache_disable which will never converge
> until isolated workload does the draining. So it indeed seems like we
> need to queue draining when a page is added. Are there other places
> where we put folios into teh folio_batch than folio_batch_add? I cannot
> seem to see any...
The problem Matthew points out isn't that there would be other places that
add folios to a fbatch, other than folio_batch_add(). The problem is that
many of the folio_batch_add() add something to a temporary batch on a stack,
which is not subject to lru draining, so it's wasteful to queue the
draining for those.
The below diff should address this by creating folio_batch_add_lru() which
is only used in the right places that do fill a lru-drainable folio batch.
It also makes the function static inline again, in mm/internal.h, which
means it's adding some includes to it. For that I had to fix up some wrong
include places of internal.h in varous mm files - these includes should not
appear below "#define CREATE_TRACE_POINTS".
Frederic, feel free to fold this in your patch with my Co-developed-by:
I also noted that since this now handles invalidate_bh_lrus_cpu() maybe we
can drop the cpu_is_isolated() check from bh_lru_install(). I'm not even
sure if the check is equivalent or stronger than the check used in
isolated_task_work_queue().
I have a bit of remaining worry for __lru_add_drain_all() simply skipping
the isolated cpus because then it doesn't wait for the flushing to be
finished via flush_work(). It can thus return before the isolated cpu does
its drain-on-resume, which might violate some expectations of
lru_cache_disable(). Is there a possibility to make it wait for the drain-
on-resume or determine there's not any, in a reliable way?
----8<----