Re: [PATCH v2] mm: limit mmu_gather batching to fix soft lockups on!CONFIG_PREEMPT

From: Michal Hocko
Date: Thu Dec 20 2012 - 07:47:03 EST

On Wed 19-12-12 13:13:16, Andrew Morton wrote:
> On Wed, 19 Dec 2012 16:04:37 +0100
> Michal Hocko <mhocko@xxxxxxx> wrote:
> > Since e303297 (mm: extended batches for generic mmu_gather) we are batching
> > pages to be freed until either tlb_next_batch cannot allocate a new batch or we
> > are done.
> >
> > This works just fine most of the time but we can get in troubles with
> > on large machines where too aggressive batching might lead to soft
> > lockups during process exit path (exit_mmap) because there are no
> > scheduling points down the free_pages_and_swap_cache path and so the
> > freeing can take long enough to trigger the soft lockup.
> >
> > The lockup is harmless except when the system is setup to panic on
> > softlockup which is not that unusual.
> >
> > The simplest way to work around this issue is to limit the maximum
> > number of batches in a single mmu_gather for !CONFIG_PREEMPT kernels.
> > Let's use 1G of resident memory for the limit for now. This shouldn't
> > make the batching less effective and it shouldn't trigger lockups as
> > well because freeing 262144 should be OK.
> >
> > ...
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index ed6642a..5843f59 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -78,6 +78,19 @@ struct mmu_gather_batch {
> > #define MAX_GATHER_BATCH \
> > ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
> >
> > +/*
> > + * Limit the maximum number of mmu_gather batches for non-preemptible kernels
> > + * to reduce a risk of soft lockups on huge machines when a lot of memory is
> > + * zapped during unmapping.
> > + * 1GB of resident memory should be safe to free up at once even without
> > + * explicit preemption point.
> > + */
> > +#if defined(CONFIG_PREEMPT_COUNT)
> > +#else
> Geeze. I spent waaaaay too long staring at that expression trying to
> work out "how many pages is in a batch" and gave up.
> Realistically, I don't think we need to worry about CONFIG_PREEMPT here
> - if we just limit the thing to, say, 64k pages per batch then that
> will be OK for preemptible and non-preemptible kernels.

I wanted the fix to be as non-intrusive as possible so I didn't want to
touch PREEMPT (which is default in many configs) at all. I am OK to a
single limit of course.

> The performance difference between "64k" and "infinite" will be
> miniscule and unmeasurable.
> Also, the batch count should be independent of PAGE_SIZE. Because
> PAGE_SIZE can vary by a factor of 16 and you don't want to fix the
> problem on 4k page size but leave it broken on 64k page size.

MAX_GATHER_BATCH depends on the page size so I didn't want to differ
without a good reason.

> Also, while the patch might prevent softlockup warnings, the kernel
> will still exhibit large latency glitches and those are undesirable.

Not really. cond_resched is called per pmd. This patch just helps the
case where there is enough free memory to batch too much and then soft
lockup while flushing mmu_gather after the whole zapping is done because
tlb_flush_mmu is called more often.

> Also, does this patch actually work? It doesn't add a scheduling
> point. It assumes that by returning zero from tlb_next_batch(), the
> process will back out to some point where it hits a cond_resched()?

No, as mentioned above. cond_resched is called per pmd independently
on how much batching we do but then after free_pgtables is done we
call tlb_finish_mmu and that one needs to free all the gathered
pages. Without the limit we can have too many pages to free and that is
what triggers soft lockup. My original patch was more obvious because it
added the cond_resched but as you pointed out it could be problematic so
this patch tries to eliminate the problem in the very beginning instead.

> So I'm thinking that to address both the softlockup-detector problem
> and the large-latency-glitch problem we should do something like:
> if (need_resched() && tlb->batch_count > 64k)
> return 0;

need_resched is not needed because of cond_resched in zap_pmd_range. I
am OK with a fixed limit.

> and then ensure that there's a cond_resched() at a safe point between
> batches?

Michal Hocko
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at