Re: munmap extremely slow even with untouched mapping.

From: Robin Holt
Date: Mon Oct 31 2005 - 07:21:12 EST


On Mon, Oct 31, 2005 at 08:10:50PM +1100, Nick Piggin wrote:
> Hugh Dickins wrote:
> >On Sun, 30 Oct 2005, Nick Piggin wrote:
>
> >>I wonder if we should go with Robin's fix (+/- my variation)
> >>as a temporary measure for 2.6.15?
> >
> >
> >You're right, I was too dismissive. I've now spent a day looking into
> >the larger rework, and it's a bigger job than I'd thought - partly the
> >architecture variations, partly the fast/slow paths and other "tlb" cruft,
> >partly the truncation case's i_mmap_lock (and danger of making no progress
> >whenever we drop it). I'll have to set all that aside for now.
> >
>
> Yep, I took a quick look before adding my bugs to Robin's patch, and
> basically came to the same conclusion.
>
> >I've taken another look at the two patches. The main reason I preferred
> >yours was that I misread Robin's! But yes, yours takes it a bit further,
> >and I think that is worthwhile.
> >
>
> Sure - the nice method to solve it reasonably neatly is Robin's of
> course. I just preferred to change accounting slightly so we wouldn't,
> for example, retry the call stack after each empty pgd.
>
> >But a built and tested version would be better. Aren't you trying to
> >return addr from each level (that's what I liked, and what I'll want to
> >do in the end)? But some levels are returning nothing,
>
> Hmph, seem to have got half a patch there :P
>
> >and unmap_vmas
> >does start +=, and the huge case leaves start unchanged, and
>
> Dang, thanks.
>
> >zap_work
> >should be a long so it doesn't need casting almost everywhere, and...
> >given all that, I bet there's more!
> >
>
> Yep, good idea. New tested patch attached. Robin, if you agree with the
> changes I made, would you sign this one and send it on to Linus and/or
> Andrew, please?
>
> >As to whether p??_none should count for 1 where !pte_none counts for
> >PAGE_SIZE, well, they say a picture is worth a thousand words, and I'm
> >sure that's entered your calculation ;-) I'd probably make the p??_none
>
> That didn't, but it confirms my theory. I was working on the basis
> that a *page* is worth a thousand words - but I think we can assume
> any picture worth looking at will be given its own page.
>
> I don't know if we should worry about increasing p??_none case - they
> are going to be largely operating on contiguous memory, free of atomic
> ops. And the cost for pages must also include the TLB and possible
> freeing work too.
>
> Trivial to change once the patch is in though.
>
> >count for a little more. Perhaps we should get everyone involved in a
> >great profiling effort across the architectures to determine it.
> >Config option. Sys tunable. I'll shut up.
> >
>
> I was going to make it a tunable, but then I realised someone already
> has - /dev/kmem ;)
>
> --
> SUSE Labs, Novell Inc.

> The address based work estimate for unmapping (for lockbreak) is and
> always was horribly inefficient for sparse mappings. The problem is most
> simply explained with an example:
>
> If we find a pgd is clear, we still have to call into unmap_page_range
> PGDIR_SIZE / ZAP_BLOCK_SIZE times, each time checking the clear pgd, in
> order to progress the working address to the next pgd.
>
> The fundamental way to solve the problem is to keep track of the end address
> we've processed and pass it back to the higher layers.
>
> From: Robin Holt <holt@xxxxxxx>
>
> Modification to completely get away from address based work estimate and
> instead use an abstract count, with a very small cost for empty entries as
> opposed to present pages.
>
> On 2.6.14-git2, ppc64, and CONFIG_PREEMPT=y, mapping and unmapping 1TB of
> virtual address space takes 1.69s; with the following patch applied, this
> operation can be done 1000 times in less than 0.01s
>
> Signed-off-by: Nick Piggin <npiggin@xxxxxxx>

Thanks, Nick, for doing all the work for me. I still haven't tested this
but it is definitely what I had hoped would be done to speed this up.

Signed-off-by: Robin Holt <holt@xxxxxxx>


Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -549,10 +549,10 @@ int copy_page_range(struct mm_struct *ds
return 0;
}

-static void zap_pte_range(struct mmu_gather *tlb,
+static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
- struct zap_details *details)
+ long *zap_work, struct zap_details *details)
{
struct mm_struct *mm = tlb->mm;
pte_t *pte;
@@ -563,10 +563,15 @@ static void zap_pte_range(struct mmu_gat
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
do {
pte_t ptent = *pte;
- if (pte_none(ptent))
+ if (pte_none(ptent)) {
+ (*zap_work)--;
continue;
+ }
if (pte_present(ptent)) {
struct page *page = NULL;
+
+ (*zap_work) -= PAGE_SIZE;
+
if (!(vma->vm_flags & VM_RESERVED)) {
unsigned long pfn = pte_pfn(ptent);
if (unlikely(!pfn_valid(pfn)))
@@ -624,16 +629,18 @@ static void zap_pte_range(struct mmu_gat
if (!pte_file(ptent))
free_swap_and_cache(pte_to_swp_entry(ptent));
pte_clear_full(mm, addr, pte, tlb->fullmm);
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

add_mm_rss(mm, file_rss, anon_rss);
pte_unmap_unlock(pte - 1, ptl);
+
+ return addr;
}

-static inline void zap_pmd_range(struct mmu_gather *tlb,
+static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end,
- struct zap_details *details)
+ long *zap_work, struct zap_details *details)
{
pmd_t *pmd;
unsigned long next;
@@ -641,16 +648,21 @@ static inline void zap_pmd_range(struct
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_none_or_clear_bad(pmd))
+ if (pmd_none_or_clear_bad(pmd)) {
+ (*zap_work)--;
continue;
- zap_pte_range(tlb, vma, pmd, addr, next, details);
- } while (pmd++, addr = next, addr != end);
+ }
+ next = zap_pte_range(tlb, vma, pmd, addr, next,
+ zap_work, details);
+ } while (pmd++, addr = next, (addr != end && *zap_work > 0));
+
+ return addr;
}

-static inline void zap_pud_range(struct mmu_gather *tlb,
+static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end,
- struct zap_details *details)
+ long *zap_work, struct zap_details *details)
{
pud_t *pud;
unsigned long next;
@@ -658,15 +670,21 @@ static inline void zap_pud_range(struct
pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_none_or_clear_bad(pud))
+ if (pud_none_or_clear_bad(pud)) {
+ (*zap_work)--;
continue;
- zap_pmd_range(tlb, vma, pud, addr, next, details);
- } while (pud++, addr = next, addr != end);
+ }
+ next = zap_pmd_range(tlb, vma, pud, addr, next,
+ zap_work, details);
+ } while (pud++, addr = next, (addr != end && *zap_work > 0));
+
+ return addr;
}

-static void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+static unsigned long unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
- struct zap_details *details)
+ long *zap_work, struct zap_details *details)
{
pgd_t *pgd;
unsigned long next;
@@ -679,11 +697,16 @@ static void unmap_page_range(struct mmu_
pgd = pgd_offset(vma->vm_mm, addr);
do {
next = pgd_addr_end(addr, end);
- if (pgd_none_or_clear_bad(pgd))
+ if (pgd_none_or_clear_bad(pgd)) {
+ (*zap_work)--;
continue;
- zap_pud_range(tlb, vma, pgd, addr, next, details);
- } while (pgd++, addr = next, addr != end);
+ }
+ next = zap_pud_range(tlb, vma, pgd, addr, next,
+ zap_work, details);
+ } while (pgd++, addr = next, (addr != end && *zap_work > 0));
tlb_end_vma(tlb, vma);
+
+ return addr;
}

#ifdef CONFIG_PREEMPT
@@ -724,7 +747,7 @@ unsigned long unmap_vmas(struct mmu_gath
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
{
- unsigned long zap_bytes = ZAP_BLOCK_SIZE;
+ long zap_work = ZAP_BLOCK_SIZE;
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
int tlb_start_valid = 0;
unsigned long start = start_addr;
@@ -745,27 +768,25 @@ unsigned long unmap_vmas(struct mmu_gath
*nr_accounted += (end - start) >> PAGE_SHIFT;

while (start != end) {
- unsigned long block;
-
if (!tlb_start_valid) {
tlb_start = start;
tlb_start_valid = 1;
}

- if (is_vm_hugetlb_page(vma)) {
- block = end - start;
+ if (unlikely(is_vm_hugetlb_page(vma))) {
unmap_hugepage_range(vma, start, end);
- } else {
- block = min(zap_bytes, end - start);
- unmap_page_range(*tlbp, vma, start,
- start + block, details);
+ zap_work -= (end - start) /
+ (HPAGE_SIZE / PAGE_SIZE);
+ start = end;
+ } else
+ start = unmap_page_range(*tlbp, vma,
+ start, end, &zap_work, details);
+
+ if (zap_work > 0) {
+ BUG_ON(start != end);
+ break;
}

- start += block;
- zap_bytes -= block;
- if ((long)zap_bytes > 0)
- continue;
-
tlb_finish_mmu(*tlbp, tlb_start, start);

if (need_resched() ||
@@ -779,7 +800,7 @@ unsigned long unmap_vmas(struct mmu_gath

*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
tlb_start_valid = 0;
- zap_bytes = ZAP_BLOCK_SIZE;
+ zap_work = ZAP_BLOCK_SIZE;
}
}
out:

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/