2.4.15pre6aa1 (fixes google VM problem)

From: Andrea Arcangeli (andrea@suse.de)
Date: Sun Nov 18 2001 - 03:24:34 EST

Next message: Andrea Arcangeli: "Re: VM-related Oops: 2.4.15pre1"
Previous message: Roy S.C. Ho: "Re: Raw access to block devices"
Next in thread: Andrea Arcangeli: "Re: 2.4.15pre6aa1 (fixes google VM problem)"
Reply: Andrea Arcangeli: "Re: 2.4.15pre6aa1 (fixes google VM problem)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

It would be interesting if people experiencing the VM problems
originally reported by google (but also trivially reproducible with
simple cache operations) could verify that this update fixes those
troubles. I wrote some documentation on the bug and the relevant fix in
the vm-14 section below. Thanks.

If all works right on Monday I will port the fix to mainline (it's
basically only a matter of extracting a few bits from the vm-14 patch,
it's not really controversial but I didn't had much time to extract it
yet, the reason it's not in a self contained patch from the first place
is because of the way it was written). Comments are welcome of course, I
don't think there's another way around it though, even if we would
generate a logical swap cache not in function of the swap entry that
still wouldn't solve the problem of mlocked highmem users [or very
frequently accessed ptes] in the lowmem zones. The lowmem ram wasted for
this purpose is very minor compared to the total waste of all the
highmem zones, and the algorithm I implemented adapts in function of the
amount of highmem so the lowmem waste is proportial with the potential
highmem waste. However the lower_zone_reserve defaults could be changed,
I choosen the current defaults in a conservative manner.

URL:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1.bz2
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.15pre6aa1/

Only in 2.4.15pre1aa1: 00_lvm-1.0.1-rc4-3.bz2
Only in 2.4.15pre6aa1: 00_lvm-1.0.1-rc4-4.bz2

Rest of the rc4 diffs rediffed.

Only in 2.4.15pre1aa1: 00_rwsem-fair-23
Only in 2.4.15pre6aa1: 00_rwsem-fair-24
Only in 2.4.15pre1aa1: 00_rwsem-fair-23-recursive-4
Only in 2.4.15pre6aa1: 00_rwsem-fair-24-recursive-5

Rediffed.

Only in 2.4.15pre1aa1: 00_strnlen_user-x86-ret1-1

Merged in mainline.

Only in 2.4.15pre1aa1: 10_lvm-deadlock-fix-1

Now in mainline.

Only in 2.4.15pre1aa1: 10_lvm-incremental-1
Only in 2.4.15pre6aa1: 10_lvm-incremental-2

Part of it in mainline, rediffed the rest.

Only in 2.4.15pre1aa1: 10_vm-13
Only in 2.4.15pre6aa1: 10_vm-14

        This should be the first kernel out there without the google VM
        troubles (that are affecting more than just google testcase). The
        broken piece of VM was this kind of loop in the allocator:

        for (;;) {
                zone_t *z = *(zone++);
                if (!z)
                        break;

                if (zone_free_pages(z, order) > z->pages_low) {
                        page = rmqueue(z, order);
                        if (page)
                                return page;
                }
        }

        and the above logic is present in all 2.4 kernels out there (2.3 as well).
        So the bug has nearly nothing to do with the memory balancing engine as
        most of us would expect, it's an allocator zone balancing bug instead in
        a piece of code that one would assume to be obviously correct.

        The problem cames from the fact that all the ZONE_NORMAL can be allocated with
        unfreeable highmem users (like anon pages when no swap is available).
        If that happens the machine runs out of memory no matter what (even if
        there are 63G of cache clean ready to be freed). Mainline deadlocks
        because of the infinite loop in the allocator, -aa was ""correctly""
        just killing tasks as soon as the normal zone was filled of mlocked
        cache or anon pages with no swap.

        The fix is to have a per-classzone per-zone set of watermarks (see the
        zone->watermarks[class_idx] array). Seems to work fine here. Of course
        this means potentially wasting some memory when the highmem zone is
        huge but there's no other way around it and the potential waste of all the
        highmem memory is huge compared to a very small waste of the normal
        zone (it could be more finegrined of course, for example we don't keep
        track if an allocation will generate a page freeable from the VM or
        not, but those are minor issues and not easily solvable anyways [we pin
        pages with a get_page and we certainly don't want to migrate pages
        across zones within get_page], and the core problem should be just fixed).

        Since the logic is generic and applies also to the zone dma vs zone
        normal (not only zone normal vs zone highmem) this should be tested a
        bit on the lowmem boxes too (I just took care of the lowmem boxes in
        theory, but I didn't tested it in practice).

        In short now we reserve a part of the lower zones for the lower
        classzone allocations. The algorithm I wrote calculates the "reserved
        portion" in function of the size of the higher zone (higher zone means
        the "zone" that matches the "classzone"). For example a 1G machine will
        reserve a very little part of the zone_normal. A 64G machine is going
        to reserve all the 800mbyte of zone normal for allocations from
        the normal classzone instead (this is fine because it would be a total
        waste if a 64G machine would risk to run OOM because the zone normal
        is all occupied by unfreeable highmem users that would much better stay
        in the highmem zone instead). The ratio between higher zone size and
        reserved lower zone size, is selectable via boot option ala memfrac=
        (the new option is called lower_zone_reserve=). Default values should
        work well (they as usual doesn't need to be perfect, but they can be
        changed if you've suggestions), the boot option is there just in case.

Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1
Only in 2.4.15pre6aa1: 10_vm-14-no-anonlru-1-simple-cache-1

        Backed out the anon pages from the lru again, mainly to avoid to
        swapout too easily and because this is going to be tested on the
        big boxes with no swap at all anyways.

Only in 2.4.15pre1aa1: 50_uml-patch-2.4.13-5.bz2
Only in 2.4.15pre6aa1: 50_uml-patch-2.4.14-2.bz2

Latest Jeff's uml update.

Only in 2.4.15pre1aa1: 60_tux-2.4.13-ac5-B0.bz2
Only in 2.4.15pre6aa1: 60_tux-2.4.13-ac5-B1.bz2

Latest Ingo's tux update.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrea Arcangeli: "Re: VM-related Oops: 2.4.15pre1"
Previous message: Roy S.C. Ho: "Re: Raw access to block devices"
Next in thread: Andrea Arcangeli: "Re: 2.4.15pre6aa1 (fixes google VM problem)"
Reply: Andrea Arcangeli: "Re: 2.4.15pre6aa1 (fixes google VM problem)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Nov 23 2001 - 21:00:17 EST