Re: NUMA API

From: Ulrich Drepper
Date: Mon May 03 2004 - 13:36:47 EST


Andi Kleen wrote:

> You mean numa_no_nodes et.al. ?
>
> This is essentially static data that never changes (like in6addr_any).
> numa_all_nodes could maybe in future change with node hotplug support,
> but even then it will be a global property.

And you don't see a problem with this? You are hardcoding variables of
a certain size and layout. This is against every good design principal.

There are other problems like not using a protected namespace.


> Everything else is thread local.

This, too, is not adequate. It requires working on the 1-on-1 model.
Using user-level contexts (setcontext etc) is made very hard and
expensive. There is not one state to change. Using any of the code (or
functionality using the state) in signal handlers, possibly recursive,
will throw things in disorder.

There is no reason to not make the state explicit.


> I believe it is good enough for current machines, at least
> until there is enough experience to really figure out what
> node discovery is needed..

That's the point. We cannot start using an inadequate API now since one
will _never_ be able to get rid of it again. We have accumulated
several examples of this in the past years.

The API design should be general enough to work for all the
architectures which are currently envisioned and must be extensible to
be extendable for future architectures. Your API does not allow to
write adequate code even on some/many of today's architectures.



> I have seen some proposals
> for complex graph based descriptions, but so far I have seen
> nothing that could really take advantage of something so fancy.

I am proposing no graph-based descriptions. And this is something where
you miss the point. This is only the lowest level interface. It
prorvides enough functionality to describe the machine architecture. If
some fancy alternative representations are needed this is something for
a higher-level interface.

What must be avoided at all costs is programs peeking into /sys and
/proc to determine the topology. First of all, this makes programs
architecture and even machine-specific. Second, the /sys and /proc
format will change over time. All these access must be hidden and the
NUMA library is just the place for that.

Saying

> If it should be really needed it can be added later.

just means we will get programs today which hardcode today's existing
and most probably inadequate representation of the topology in /sys and
/proc.


>>~ no inclusion of SMT/multicore in the cpu hierarchy
>
>
> Not sure why you would care about that.

These are two sides of the same coin. Today we already have problems
with programs running on machines with SMT processors. How can those
use pthread_setaffinity() to create theoptimal number of threads and
place them accordingly? It requires magic /proc parsing for each and
every architecture. The problem is exactly the same as with NUMA and
the interface extensions to cover MC/SMT as well are minimal.


> Well, I spent a lot of time talking to various users; and IMHO
> it matches the needs of a lot of them. I did not add all the features
> everybody wanted, but that was simply not possible and still comming
> up with a reasonable design.

And this means it should not be done?


> The per process state is needed for numactl though.
>
> I kept the support for this visible in libnuma to make it easier to convert
> old code to this (just wrap some code with a policy) For designed from
> scratch programs it is probably better to use the allocation functions
> with mbind directly.

The NUMA library interface should not be cluttered because of
considerations of legacy apps which need to be converted. These are
separate issues, the design of the API must not be influenced by this.
The problem always has been that in such cases the correct interfaces
are not being used but instead the "easier to use" legacy interfaces are
used.


>>Also, the concept of hard/soft sets for CPUs is useful. Likewise
>>"spilling" over to other memory nodes. Usually using NUMA means hinting
>>the desired configuration to the system. It'll be used whenever
>>possible. If it is not possible (for instance, if a given processor is
>>not available) it is mostly no good idea to completely fail the
>
>
> Agreed. That is why prefered and bind are different policies
> and you can switch between them in libnuma.

That is inadequate. Any process/thread state like this increases the
program cost since it means that the program at all times must remember
the current state and switch if necessary. Combine this with 3rd party
libraries using the functionality as well and you'll get explicit
switching before *every* memory allocation because one cannot assume
anything about the state. Even if the NUMA library keeps track of the
state internally, there is always the possibility that more than one
instance of the library is used at any one time (e.g., statically linked
into a DSO).

I repeast myself: global or thread-local states are bad. Always have
been, always will be.


--
â Ulrich Drepper â Red Hat, Inc. â 444 Castro St â Mountain View, CA â
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/