[patch] arbitrary size memory allocator, memarea-2.4.15-D6

From: Ingo Molnar (mingo@elte.hu)
Date: Mon Nov 12 2001 - 11:59:00 EST


in the past couple of years the buddy allocator has started to show
limitations that are hurting performance and flexibility.

eg. one of the main reasons why we keep MAX_ORDER at an almost obscenely
high level is the fact that we occasionally have to allocate big,
physically continuous memory areas. We do not realistically expect to be
able to allocate such high-order pages after bootup, still every page
allocation carries the cost of it. And even with MAX_ORDER at 10, large
RAM boxes have hit this limit and are hurting visibly - as witnessed by
Anton. Falling back to vmalloc() is not a high-quality option, due to the
TLB-miss overhead.

If we had an allocator that could handle large, rare but
performance-insensitive allocations, then we could decrease MAX_ORDER back
to 5 or 6, which would result in less cache-footprint and faster operation
of the page allocator.

the attached memarea-2.4.15-D6 patch does just this: it implements a new
'memarea' allocator which uses the buddy allocator data structures without
impacting buddy allocator performance. It has two main entry points:

        struct page * alloc_memarea(unsigned int gfp_mask, unsigned int pages);
        void free_memarea(struct page *area, unsigned int pages);

the main properties of the memarea allocator are:

 - to be an 'unlimited size' allocator: it will find and allocate 100 GB
   of physically continuous memory if that much RAM is available.

 - no alignment or size limitations either, size does not have to be a
   power of 2 like for the buddy allocator, and alignment will be whatever
   constellation the allocator finds. This property ensures that if there
   is a sufficiently sized physically continous piece of RAM available,
   the allocator will find it. The buddy allocator only finds order-2
   aligned and order-2 sized pages.

 - no impact on the performance of the page allocator. (The only (very
   small) effect is the use of list_del_init() instead of list_del() when
   allocating pages. This is insignificant as the initialization will be
   done in two assembly instructions, touching an already present and
   dirty cacheline.)

Obviously, alloc_memarea() can be pretty slow if RAM is getting full, nor
does it guarantee allocation, so for non-boot allocations other backup
mechanizms have to be used, such as vmalloc(). It is not a replacement for
the buddy allocator - it's not intended for frequent use.

right now the memarea allocator is used in one place: to allocate the
pagecache hash table at boot time. [ Anton, it would be nice if you could
check it out on your large-RAM box, does it improve the hash chain
situation? ]

other candidates of alloc_memarea() usage are:

  - module code segment allocation, fall back to vmalloc() if failure.

  - swap map allocation, it uses vmalloc() now.

  - buffer, inode, dentry, TCP hash allocations. (in case we decrease
    MAX_ORDER, which the patch does not do yet.)

  - those funky PCI devices that need some big chunk of physical memory.

  - other uses?

alloc_memarea() tries to optimize away as much as possible from linear
scanning of zone mem-maps, but the worst-case scenario is that it has to
iterate over all pages - which can be ~256K iterations if eg. we search on
a 1 GB box.

possible future improvements:

- alloc_memarea() could zap clean pagecache pages as well.

- if/once reverse pte mappings are added, alloc_memarea() could also
  initiate the swapout of anonymous & dirty pages. These modifications
  would make it pretty likely to succeed if the allocation size is
  realistic.

- possibly add 'alignment' and 'offset' to the __alloc_memarea()
  arguments, to possibly create a given alignment for the memarea, to
  handle really broken hardware and possibly result in better page
  coloring as well.

- if we extended the buddy allocator to have a page-granularity bitmap as
  well, then alloc_memarea() could search for physically continuous page
  areas *much* faster. But this creates a real runtime (and cache
  footprint) overhead in the buddy allocator.

the patch also cleans up the buddy allocator code:

  - cleaned up the zone structure namespace

  - removed the memlist_ defines. (I originally added them to play
    with FIFO vs. LIFO allocation, but now we have settled for the later.)

  - simplified code

  - ( fixed index to be unsigned long in rmqueue(). This enables 64-bit
    systems to have more than 32 TB of RAM in a single zone. [not quite
    realistic, yet, but hey.] )

NOTE: the memarea allocator pieces are in separate chunks and are
completely non-intrusive if the filemap.c change is omitted.

i've tested the patch pretty thoroughly on big and small RAM boxes. The
patch is against 2.4.15-pre3.

Reports, comments, suggestions welcome,

        Ingo



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Nov 15 2001 - 21:00:29 EST