Pathetic Imitation of a Zone Allocator

Kevin Buhr (buhr@stat.wisc.edu)
20 Jul 1998 11:17:24 -0500


Well, here's some actual code I put together this weekend (before I
even read Linus's demand for real code, so it must have been some sort
of premonition).

I've modified the current buddy allocator to make it a bit more like a
zone allocator (at the risk of creating an allocator nobody will
like).

The result is a kind of multilegged mutant: a unified buddy allocator
from the waist up and a collection of parallel, per-zone buddy
allocators from the waist down. The "waist" here is an arbitrary
chosen page order (currently the top order, 128k). See below for a
little more detail on the design.

The nice things about this mutant are that it acts enough like a zone
allocator that it should help things on the fragmentation front, yet
its implementation is so similar to the old buddy allocator that, if
we're going to release 2.2.0 with a new allocator, it seems safer to
release it with *this* than with a completely retooled and relatively
untesting zone allocator. In short, I see this as a kind of stop-gap
measure for 2.2.x until we get an industrial-strength alternative
working and debugged in 2.3.x.

For allocations that can be satisfied by a per-zone allocator, the
critical code path is almost the same as under the current scheme. Of
course, we pay a cost for allocations that must go to the unified
cache, and we pay a cost if, when the pages are freed, gluing of
buddies requires returning pages to the unified queue.

The patch, against 2.1.106, isn't huge, so I've included it (and a
one-page design document) below.

Any comments?

Kevin <buhr@stat.wisc.edu>

* * *

Piza -- Pathetic Imitation of a Zone Allocator
==============================================

Piza is a modified version of the existing 2.1.x buddy system that may
give some of the benefits of a true zone allocator, particularly with
respect to memory fragmentation.

It currently supports two zones. By convention, one zone (the
"fragmentation zone") is used for memory objects of a page or less
that can fragment badly, including certain slab caches and the inode
and dentry caches. The other zone (the "main zone") is used for (a)
memory objects of 2 or more (consecutive) pages; and (b) memory
objects, like the page and buffer caches, that can be easily freed.
(NOTE: In the current implementation, only allocations for slab caches
of order 0 go into the fragmentation zone; everything else goes into
the main zone. There are also two test zones that aren't used for
anything.)

Above a certain page order called the "zone order" (currently 128k),
piza has a single, unified buddy allocator. Below this page order,
piza operates multiple per-zone buddy allocators in parallel.

"get_free_pages"
----------------

An allocation request larger than the zone order is always satisfied
by the unified allocator. When a smaller allocation is requested for
pages from a specified zone, the zone allocator for that zone is used.
If the zone allocator can satisfy the request, the operation is nearly
as efficient as under the stock kernel allocator.

However, if during the queue search we exceed the zone order, then the
unified allocator is used to satisfy the request. If this succeeds,
the pages taken from the unified allocator become the "property" of a
particular zone allocator, and it will use them to satisfy future
memory requests.

If both the zone allocator and the unified allocator fail to satisfy
the request, things are pretty desperate. We fall back on checking
each of the other zone allocators in turn. If this fails, under the
stock kernel allocator the page request would have failed anyway, so
we can give up. (NOTE: the current implementation doesn't do this
fall-back.)

"free_pages"
------------

When a page is freed, its zone is determined automatically, and it can
be returned to the zone allocator that allocated it. If, as the buddy
system glues together "buddies", we exceed the zone order, the current
page group becomes the "property" of the unified allocator again, and
it continues the process of gluing its buddies together.

"free_memory_available"
-----------------------

Piza only indicates that it has sufficient memory free if it is
able to satisfy any reasonable page allocation request for every zone
from the zone's allocator or the unified allocator without resorting
to scanning the other zone allocators. This forces "kswapd" to work
to free up pages in, say, the fragmentation zone, even if the main
zone has lots of pages free to satisfy fragmentation zone allocation
requests.

* * *

diff -u -r linux-raspberry/mm/page_alloc.c linux-blackberry/mm/page_alloc.c
--- linux-raspberry/mm/page_alloc.c Fri May 8 01:44:25 1998
+++ linux-blackberry/mm/page_alloc.c Mon Jul 20 10:03:42 1998
@@ -3,6 +3,7 @@
*
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
* Swap reorganised 29.12.95, Stephen Tweedie
+ * Pathetic Imitation of a Zone Allocator 20/7/98, Kevin Buhr <buhr@stat.wisc.edu>
*/

#include <linux/config.h>
@@ -46,6 +47,9 @@
#define NR_MEM_LISTS 6
#endif

+#define NR_ZONES 4
+#define ZONE_ORDER 5 /* must be < NR_MEM_LISTS */
+
/* The start of this MUST match the start of "struct page" */
struct free_area_struct {
struct page *next;
@@ -55,7 +59,10 @@

#define memory_head(x) ((struct page *)(x))

-static struct free_area_struct free_area[NR_MEM_LISTS];
+static struct free_area_struct unified_free_area[NR_MEM_LISTS-ZONE_ORDER];
+static struct free_area_struct zone_free_area[NR_ZONES][ZONE_ORDER];
+/* BOOKMARK -- pack this, if convenient */
+static unsigned char * zone_map;

static inline void init_mem_queue(struct free_area_struct * head)
{
@@ -114,6 +121,7 @@
int retval = 0;
unsigned long flags;
struct free_area_struct * list;
+ int nr2, zone;

/*
* If we have more than about 3% to 5% of all memory free,
@@ -126,33 +134,61 @@
if (nr_free_pages > (nr ? freepages.low : freepages.high))
return nr+1;

- list = free_area + NR_MEM_LISTS;
+ list = unified_free_area + NR_MEM_LISTS - ZONE_ORDER;
spin_lock_irqsave(&page_alloc_lock, flags);
+
/* We fall through the loop if the list contains one
* item. -- thanks to Colin Plumb <colin@nyx.net>
*/
do {
list--;
/* Empty list? Bad - we need more memory */
if (list->next == memory_head(list))
- break;
- /* One item on the list? Look further */
- if (list->next->next == memory_head(list))
- continue;
+ goto done;
/* More than one item? We're ok */
- retval = nr + 1;
- break;
- } while (--nr >= 0);
+ if (list->next->next != memory_head(list)) {
+ retval = nr + 1;
+ goto done;
+ }
+ /* One item on the list? Look further */
+ } while (--nr >= 0 && list > unified_free_area);
+
+ retval = nr + 1;
+ for (zone = 0; zone < NR_ZONES; ++zone) {
+ list = zone_free_area[zone] + ZONE_ORDER;
+ nr2 = nr;
+ while (nr2 >= 0) {
+ --list;
+ /* as above */
+ if (list->next == memory_head(list)) {
+ retval = 0;
+ goto done;
+ }
+ if (list->next->next != memory_head(list)) {
+ /* return minimum value */
+ if (nr2 + 1 < retval)
+ retval = nr2 + 1;
+ break; /* check next zone */
+ }
+ --nr2;
+ }
+ /* we fell through -- insufficient memory in this zone */
+ retval = 0;
+ goto done;
+ }
+
+ done:
spin_unlock_irqrestore(&page_alloc_lock, flags);
return retval;
}

static inline void free_pages_ok(unsigned long map_nr, unsigned long order)
{
- struct free_area_struct *area = free_area + order;
+ struct free_area_struct *area;
unsigned long index = map_nr >> (1 + order);
unsigned long mask = (~0UL) << order;
unsigned long flags;
+ int zone;

spin_lock_irqsave(&page_alloc_lock, flags);

@@ -160,6 +196,23 @@

map_nr &= mask;
nr_free_pages -= mask;
+ if (order < ZONE_ORDER) {
+ zone = zone_map[map_nr >> ZONE_ORDER];
+ if (zone < 0 || zone >= NR_ZONES)
+ panic ("free_pages_ok: map_nr=%lu zone=%d", map_nr, zone);
+ area = zone_free_area[zone] + order;
+ do {
+ if (!test_and_change_bit(index, area->map))
+ goto add_to_queue;
+ remove_mem_queue(list(map_nr ^ -mask));
+ mask <<= 1;
+ area++;
+ index >>= 1;
+ map_nr &= mask;
+ } while (mask + (1 << ZONE_ORDER));
+ order = ZONE_ORDER;
+ }
+ area = unified_free_area + (order - ZONE_ORDER);
while (mask + (1 << (NR_MEM_LISTS-1))) {
if (!test_and_change_bit(index, area->map))
break;
@@ -169,6 +222,7 @@
index >>= 1;
map_nr &= mask;
}
+ add_to_queue:
add_mem_queue(area, list(map_nr));

#undef list
@@ -214,30 +268,55 @@
change_bit((index) >> (1+(order)), (area)->map)
#define CAN_DMA(x) (PageDMA(x))
#define ADDRESS(x) (PAGE_OFFSET + ((x) << PAGE_SHIFT))
-#define RMQUEUE(order, maxorder, dma) \
-do { struct free_area_struct * area = free_area+order; \
- unsigned long new_order = order; \
- do { struct page *prev = memory_head(area), *ret = prev->next; \
- while (memory_head(area) != ret) { \
- if (new_order >= maxorder && ret->next == prev) \
- break; \
- if (!dma || CAN_DMA(ret)) { \
- unsigned long map_nr = ret->map_nr; \
- (prev->next = ret->next)->prev = prev; \
- MARK_USED(map_nr, new_order, area); \
- nr_free_pages -= 1 << order; \
- EXPAND(ret, map_nr, order, new_order, area); \
- spin_unlock_irqrestore(&page_alloc_lock, flags); \
- return ADDRESS(map_nr); \
- } \
- prev = ret; \
- ret = ret->next; \
- } \
- new_order++; area++; \
- } while (new_order < NR_MEM_LISTS); \
+#define RMQUEUE(zone, order, maxorder, dma) \
+do { \
+ struct free_area_struct * area; \
+ unsigned long new_order = order; \
+ if (order < ZONE_ORDER) { \
+ area = zone_free_area[zone] + order; \
+ do { \
+ struct page *prev = memory_head(area), *ret = prev->next; \
+ while (memory_head(area) != ret) { \
+ if (new_order >= maxorder && ret->next == prev) \
+ break; \
+ if (!dma || CAN_DMA(ret)) { \
+ unsigned long map_nr = ret->map_nr; \
+ (prev->next = ret->next)->prev = prev; \
+ MARK_USED(map_nr, new_order, area); \
+ nr_free_pages -= 1 << order; \
+ EXPAND_INSIDE_ZONE(ret, map_nr, order, new_order, area); \
+ spin_unlock_irqrestore(&page_alloc_lock, flags); \
+ return ADDRESS(map_nr); \
+ } \
+ prev = ret; \
+ ret = ret->next; \
+ } \
+ new_order++; area++; \
+ } while (new_order < ZONE_ORDER); \
+ } \
+ area = unified_free_area + (new_order - ZONE_ORDER); \
+ do { \
+ struct page *prev = memory_head(area), *ret = prev->next; \
+ while (memory_head(area) != ret) { \
+ if (new_order >= maxorder && ret->next == prev) \
+ break; \
+ if (!dma || CAN_DMA(ret)) { \
+ unsigned long map_nr = ret->map_nr; \
+ (prev->next = ret->next)->prev = prev; \
+ MARK_USED(map_nr, new_order, area); \
+ nr_free_pages -= 1 << order; \
+ EXPAND_OUTSIDE_ZONE(zone, ret, map_nr, order, new_order, area); \
+ spin_unlock_irqrestore(&page_alloc_lock, flags); \
+ return ADDRESS(map_nr); \
+ } \
+ prev = ret; \
+ ret = ret->next; \
+ } \
+ new_order++; area++; \
+ } while (new_order < NR_MEM_LISTS); \
} while (0)

-#define EXPAND(map,index,low,high,area) \
+#define EXPAND_INSIDE_ZONE(map,index,low,high,area) \
do { unsigned long size = 1 << high; \
while (high > low) { \
area--; high--; size >>= 1; \
@@ -250,9 +329,30 @@
map->age = PAGE_INITIAL_AGE; \
} while (0)

+/* Note: high is >= ZONE_ORDER, but low could be anything <= high */
+#define EXPAND_OUTSIDE_ZONE(zone,map,index,low,high,area) \
+do { \
+ unsigned long size = 1 << high; \
+ while (high > low) { \
+ area--; high--; size >>=1; \
+ if (high == ZONE_ORDER - 1) { \
+ /* at unified/zone allocator boundary */ \
+ zone_map[map->map_nr >> ZONE_ORDER] = zone; \
+ area = zone_free_area[zone] + high; \
+ } \
+ add_mem_queue(area, map); \
+ MARK_USED(index, high, area); \
+ index += size; \
+ map += size; \
+ } \
+ atomic_set(&map->count, 1); \
+ map->age = PAGE_INITIAL_AGE; \
+} while (0)
+
unsigned long __get_free_pages(int gfp_mask, unsigned long order)
{
unsigned long flags, maxorder;
+ int zone;

if (order >= NR_MEM_LISTS)
goto nopage;
@@ -275,8 +375,10 @@
}

for (;;) {
+ /* BOOKMARK -- look at signed/unsigned shifts */
+ zone = (gfp_mask & __GFP_ZONEMASK) >> __GFP_ZONESHIFT;
spin_lock_irqsave(&page_alloc_lock, flags);
- RMQUEUE(order, maxorder, (gfp_mask & GFP_DMA));
+ RMQUEUE(zone, order, maxorder, (gfp_mask & GFP_DMA));
spin_unlock_irqrestore(&page_alloc_lock, flags);
if (!(gfp_mask & __GFP_WAIT))
break;
@@ -284,6 +386,7 @@
break;
gfp_mask &= ~__GFP_WAIT; /* go through this only once */
maxorder = NR_MEM_LISTS; /* Allow anything this time */
+ /* NOTE: should also try other zones in an emergency */
}
nopage:
return 0;
@@ -294,24 +397,41 @@
* We also calculate the percentage fragmentation. We do this by counting the
* memory on each free list with the exception of the first item on the list.
*/
+
+static unsigned long show_free_chunks(struct free_area_struct * area, unsigned long order)
+{
+ struct page * tmp;
+ unsigned long nr = 0;
+ for (tmp = area->next ; tmp != memory_head(area) ; tmp = tmp->next) {
+ nr ++;
+ }
+ printk("%lu*%lukB ", nr, (unsigned long)((PAGE_SIZE>>10) << order));
+ return nr * ((PAGE_SIZE>>10) << order);
+}
+
void show_free_areas(void)
{
unsigned long order, flags;
- unsigned long total = 0;
+ unsigned long total;
+ int zone;

- printk("Free pages: %6dkB\n ( ",nr_free_pages<<(PAGE_SHIFT-10));
+ printk("Free pages: %6dkB\n",nr_free_pages<<(PAGE_SHIFT-10));
spin_lock_irqsave(&page_alloc_lock, flags);
- for (order=0 ; order < NR_MEM_LISTS; order++) {
- struct page * tmp;
- unsigned long nr = 0;
- for (tmp = free_area[order].next ; tmp != memory_head(free_area+order) ; tmp = tmp->next) {
- nr ++;
+ for (zone=0; zone < NR_ZONES; ++zone) {
+ printk("Zone %d: ", zone);
+ total = 0;
+ for (order=0; order < ZONE_ORDER; order++) {
+ total += show_free_chunks(zone_free_area[zone] + order, order);
}
- total += nr * ((PAGE_SIZE>>10) << order);
- printk("%lu*%lukB ", nr, (unsigned long)((PAGE_SIZE>>10) << order));
+ printk(" = %lukB\n", total);
+ }
+ printk("Unified: ");
+ total = 0;
+ for (order=ZONE_ORDER; order < NR_MEM_LISTS; order++) {
+ total += show_free_chunks(unified_free_area + order - ZONE_ORDER, order);
}
+ printk(" = %lukB\n", total);
spin_unlock_irqrestore(&page_alloc_lock, flags);
- printk("= %lukB)\n", total);
#ifdef SWAP_CACHE_INFO
show_swap_cache_info();
#endif
@@ -328,8 +448,9 @@
__initfunc(unsigned long free_area_init(unsigned long start_mem, unsigned long end_mem))
{
mem_map_t * p;
- unsigned long mask = PAGE_MASK;
- int i;
+ unsigned long mask;
+ int i, zone;
+ unsigned long size;

/*
* Select nr of pages we try to keep free for important stuff
@@ -356,17 +477,34 @@
p->map_nr = p - mem_map;
} while (p > mem_map);

- for (i = 0 ; i < NR_MEM_LISTS ; i++) {
+ /* BOOKMARK -- clean this crap up */
+ zone_map = (unsigned char *) start_mem;
+ mask = PAGE_MASK << ZONE_ORDER;
+ size = (((end_mem + ~mask) & mask) - PAGE_OFFSET) >> (PAGE_SHIFT + ZONE_ORDER);
+ memset((void *) zone_map, __GFP_ZONEMAIN >> __GFP_ZONESHIFT, size);
+ start_mem = LONG_ALIGN(start_mem + size);
+
+ mask = PAGE_MASK;
+ for (i = 0; i < NR_MEM_LISTS; i++) {
unsigned long bitmap_size;
- init_mem_queue(free_area+i);
mask += mask;
end_mem = (end_mem + ~mask) & mask;
bitmap_size = (end_mem - PAGE_OFFSET) >> (PAGE_SHIFT + i);
bitmap_size = (bitmap_size + 7) >> 3;
bitmap_size = LONG_ALIGN(bitmap_size);
- free_area[i].map = (unsigned int *) start_mem;
- memset((void *) start_mem, 0, bitmap_size);
- start_mem += bitmap_size;
+ if (i < ZONE_ORDER) {
+ for (zone = 0; zone < NR_ZONES; zone++) {
+ init_mem_queue(zone_free_area[zone]+i);
+ zone_free_area[zone][i].map = (unsigned int *) start_mem;
+ memset((void *) start_mem, 0, bitmap_size);
+ start_mem += bitmap_size;
+ }
+ } else {
+ init_mem_queue(unified_free_area+i-ZONE_ORDER);
+ unified_free_area[i-ZONE_ORDER].map = (unsigned int *) start_mem;
+ memset((void *) start_mem, 0, bitmap_size);
+ start_mem += bitmap_size;
+ }
}
return start_mem;
}
diff -u -r linux-raspberry/mm/slab.c linux-blackberry/mm/slab.c
--- linux-raspberry/mm/slab.c Sun Jun 21 15:07:07 1998
+++ linux-blackberry/mm/slab.c Mon Jul 20 10:06:46 1998
@@ -502,6 +502,8 @@
void *addr;

*dma = flags & SLAB_DMA;
+ /* assume single-page slabs will fragment badly */
+ if (!cachep->c_gfporder) flags |= GFP_FRAG;
addr = (void*) __get_free_pages(flags, cachep->c_gfporder);
/* Assume that now we have the pages no one else can legally
* messes with the 'struct page's.
--- linux-old-vanilla/include/linux/mm.h Sat Jun 20 20:38:26 1998
+++ linux-blackberry/include/linux/mm.h Thu Jul 16 13:07:15 1998
@@ -325,6 +325,14 @@
#define __GFP_MED 0x04
#define __GFP_HIGH 0x08

+/* piza zones -- see mm/page_alloc.c */
+#define __GFP_ZONEMASK 0x30
+#define __GFP_ZONESHIFT 4
+#define __GFP_ZONEMAIN 0x00 /* main zone */
+#define __GFP_ZONEFRAG 0x10 /* fragmentation zone */
+#define __GFP_ZONE2 0x20 /* test zones */
+#define __GFP_ZONE3 0x30 /* test zones */
+
#define __GFP_DMA 0x80

#define GFP_BUFFER (__GFP_LOW | __GFP_WAIT)
@@ -332,6 +340,11 @@
#define GFP_USER (__GFP_LOW | __GFP_WAIT | __GFP_IO)
#define GFP_KERNEL (__GFP_LOW | __GFP_WAIT | __GFP_IO)
#define GFP_NFS (__GFP_MED | __GFP_WAIT | __GFP_IO)
+
+/* indicates an allocation from the fragmentation zone, for allocations
+ with the potential to fragment badly */
+
+#define GFP_FRAG __GFP_ZONEFRAG

/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html