Re: [PATCH 13/35] autonuma: add page structure fields

From: Andrea Arcangeli
Date: Tue Jun 05 2012 - 10:52:51 EST


Hi,

On Thu, May 31, 2012 at 08:18:59PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 15:49 +0200, Andrea Arcangeli wrote:
> >
> > I'm thinking about it but probably reducing the page_autonuma to one
> > per pmd is going to be the simplest solution considering by default we
> > only track the pmd anyway.
>
> Do also consider that some archs have larger base page size. So their
> effective PMD size is increased as well.

With a larger PAGE_SIZE like 64k I doubt this would be a concern, it's
just 4k is too small.

Now I did a number of cleanups and already added a number of comments,
I'll write proper badly needed docs on the autonuma_balance() function
ASAP, but at least a number of cleanups are already committed in the
autonuma branch of my git tree.

>From my side, the thing that annoys me the most at the moment is the
page_autonuma size.

So I gave more thought about the idea outlined above but well I gave
up in less than a minute of thinking what I could run into doing
that. The fact we do pmd tracking in knuma_scand by default (possible
to disable with sysfs) is irrelevant. Unless I'm only going to track
THP pages, 1 page_autonuma per pmd won't work, when the pmd_numa fault
triggers it's all nonlinear on whatever scattered 4k page is pointed
by the pte, not shared pagecache especially.

I kept thinking more on it, I should have now figured how to reduce
the page_autonuma to 12 bytes per 4k page on both 32bit and 64bit
without losing information (no code written yet but this one should
work). I just couldn't shrink it below 12 bytes without going into
ridiculous high and worthless complexities.

After this change AutoNUMA will bail out if any of the two below
conditions is true:

1) MAX_NUMNODES >= 65536
2) any NUMA node pgdat.node_spanned_pages >= 16TB/PAGE_SIZE

That means AutoNUMA will disengage itself automatically on boot on x86
NUMA systems with more than 1152921504606846976 of ram, that's 60bit
of physical address space and no x86 CPU even gets that far in terms
of physical address space.

Other archs requiring more memory than that, will hopefully have a
PAGE_SIZE > 4KB (in turn doubling up the per-node limit of ram at
every doubling of the PAGE_SIZE without having to increase the size of
the page_autonuma even on 64bit from 12bytes).

A packed 12 bytes per page should be all I need (maybe some arch with
alignment troubles may prefer to make it a 16 bytes, but on x86 packed
should work). So on x86 that's 0.29% of RAM used for autonuma and only
spent when booting on NUMA hardware (and trivial to get rid of by
passing "noatuonuma" on the command line).

If I leave the anti false sharing last_nid information in the page
structure plus a pointer to a dynamic structure, that would be still
about 12 bytes. So I rather spend those 12 bytes to avoid having to
point to a dynamic object which in fact would waste even more memory
in addition to the 12 bytes of pointer+last_nid.

The details of the solution:

struct page_autonuma {
short autonuma_last_nid;
short autonuma_migrate_nid;
unsigned int pfn_offset_next;
unsigned int pfn_offset_prev;
} __attribute__((packed));

page_autonuma can only point to a page that belongs to the same node
(page_autonuma is queued into the
NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
src_nid is the source node that page_autonuma belongs to, so all pages
in the autonuma_migrate_head[src_nid] lru must come from the same
src_nid. So the next page_autonuma in the list will be
lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
page_autonuma->pfn_offset_next)) etc..

Of course all list_add/del must be hardcoded specially for this, but
it's not a conceptually difficult solution, just we can't use list.h
and stright pointers anymore and some conversion must happen.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/