Re: Avoiding external fragmentation with a placement policy Version 12

From: Martin J. Bligh
Date: Wed Jun 01 2005 - 19:09:49 EST


>>> It adds a lot of complexity to the page allocator and while
>>> it might be very good, the only improvement we've been shown
>>> yet is allocating lots of MAX_ORDER allocations I think? (ie.
>>> not very useful)
>>
>>
>> I agree that MAX_ORDER allocs aren't interesting, but we can hit
>> frag problems easily at way less than max order. CIFS does it, NFS
>> does it, jumbo frame gigabit ethernet does it, to name a few. The
>> most common failure I see is order 3.
>>
>
> Still? We had a lot of problems with kswapd not doing its
> job properly, and min_free_kbytes reserve was buggy...
>
> But if you still trigger it, I would be interested to see
> traces. I don't frequently test things like XFS, or heavy
> gige+jumbo loads.

It gets very messy when CIFS requires a large buffer to write back
to disk in order to free memory ...

[c0000000af5e6590] [c00000000008f780] .__alloc_pages+0x3a4/0x40c (unreliable)
[c0000000af5e6670] [c0000000000afae8] .alloc_pages_current+0xac/0xd0
[c0000000af5e6700] [c00000000008f808] .__get_free_pages+0x20/0x98
[c0000000af5e6780] [c000000000094594] .kmem_getpages+0x48/0x200
[c0000000af5e6800] [c0000000000959cc] .cache_grow+0xf0/0x1f0
[c0000000af5e68b0] [c000000000095d4c] .cache_alloc_refill+0x280/0x2fc
[c0000000af5e6960] [c000000000096114] .kmem_cache_alloc+0x9c/0xc0
[c0000000af5e69f0] [c00000000008db54] .mempool_alloc_slab+0x1c/0x30
[c0000000af5e6a70] [c00000000008d96c] .mempool_alloc+0x154/0x234
[c0000000af5e6b80] [d0000000004f1ee0] .cifs_buf_get+0x28/0x74 [cifs]
[c0000000af5e6c00] [d0000000004d77d0] .smb_init+0x358/0x3c4 [cifs]
[c0000000af5e6d20] [d0000000004d90a8] .CIFSSMBWrite+0x7c/0x34c [cifs]
[c0000000af5e6e00] [d0000000004ec394] .cifs_write+0x204/0x374 [cifs]
[c0000000af5e6ef0] [d0000000004ec6d0] .cifs_partialpagewrite+0x1cc/0x2b8 [cifs]
[c0000000af5e6fc0] [d0000000004ec87c] .cifs_writepage+0xc0/0x148 [cifs]
[c0000000af5e7050] [c000000000099834] .pageout+0x138/0x1c4
[c0000000af5e7130] [c000000000099b90] .shrink_list+0x2d0/0x608
[c0000000af5e7280] [c00000000009a24c] .shrink_cache+0x384/0x610
[c0000000af5e73c0] [c00000000009ad1c] .shrink_zone+0x104/0x140
[c0000000af5e7460] [c00000000009ade0] .shrink_caches+0x88/0xac
[c0000000af5e74f0] [c00000000009af54] .try_to_free_pages+0x10c/0x280
[c0000000af5e75f0] [c00000000008f660] .__alloc_pages+0x284/0x40c
[c0000000af5e76d0] [c0000000000afae8] .alloc_pages_current+0xac/0xd0
[c0000000af5e7760] [c000000000093b30] .do_page_cache_readahead+0x12c/0x210
[c0000000af5e7840] [c000000000093e38] .page_cache_readahead+0x224/0x280
[c0000000af5e78d0] [c00000000008a664] .do_generic_mapping_read+0x118/0x470
[c0000000af5e7a30] [c00000000008adc0] .__generic_file_aio_read+0x1c0/0x208
[c0000000af5e7b00] [c00000000008ae4c] .generic_file_aio_read+0x44/0x54
[c0000000af5e7b90] [c0000000000b6dd4] .do_sync_read+0xb8/0xfc
[c0000000af5e7cf0] [c0000000000b6f60] .vfs_read+0x148/0x1ac
[c0000000af5e7d90] [c0000000000b72b8] .sys_read+0x4c/0x8c
[c0000000af5e7e30] [c000000000011180] syscall_exit+0x0/0x18

There's one example ... we can probably work around it if we try hard
enough. However, the fundamental question becomes "do we support higher
order allocs, or not?". If not fine ... but we ought to quit pretending
we do. If so, then we need to make them more reliable.

>> Keep a machine up for a while, get it thoroughly fragmented, then
>> push it reasonably hard constant pressure, and try allocating anything
>> large.
>>
>> Seems to me we're basically pointing a blunderbuss at memory, and
>> blowing away large portions, and *hoping* something falls out the
>> bottom that's a big enough chunk?
>
> Yeah more or less. But with the fragmentation patch, it by
> no means becomes an exact science ;) I wouldn't have thought
> it would make it hugely easier to free an order 2 or 3 area
> memory block on a loaded machine.

Ummm. so the blunderbuss is an exact science? ;-) At least it fairly
consistently doesn't work, I suppose ;-) ;-)

> It does make MAX_ORDER allocations _possible_ when previously
> they wouldn't have been, simply by virtue of trying to put all
> memory that it knows is reclaimable in a MAX_ORDER area. When
> memory fills up and you need an order 3 allocation, you're
> more or less in the same boat AFAIKS.

If we could target specific "clustered blobs" of pages, we can stand
a hope of getting some big chunks back. I think the intent is to
separate out the reclaimable from the non-reclaimable, to some extent
at least ... give us much better odds.

> Why not just have kernel allocations going from the bottom
> up, and user allocations going from the top down. That would
> get you most of the way there, wouldn't it? (disclaimer: I
> could well be talking shit here).

Not sure it's quite that simple, though I haven't looked in detail
at these patches. My point was merely that we need to do *something*.
Off the top of my head ... what happens when kernel meets user in
the middle. where do we free and allocate from now ? ;-) Once we've
been up for a while, mem is nearly all used, nearly all of the time.

Is a good discussion to have though ;-)

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/