Re: [Lse-tech] Re: NUMA scheduler 2nd approach

From: Martin J. Bligh (mbligh@aracnet.com)
Date: Mon Jan 13 2003 - 23:56:22 EST

Next message: Valdis.Kletnieks@vt.edu: "Re: [PATCH] [TRIVIAL] kstrdup"
Previous message: Protasevich, Natalie: "RE: APIC version"
In reply to: Andrew Theurer: "Re: [Lse-tech] Re: NUMA scheduler 2nd approach"
Next in thread: Erich Focht: "Re: [Lse-tech] Re: NUMA scheduler 2nd approach"
Reply: Erich Focht: "Re: [Lse-tech] Re: NUMA scheduler 2nd approach"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

>> I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
>> of the time working with the first three patches. What I found was
>> that rebalancing was happening too much between nodes. I tried a
>> few things to change this, but have not yet settled on the best
>> approach. A key item to work with is the check in find_busiest_node
>> to determine if the found node is busier enough to warrant stealing
>> from it. Currently the check is that the node has 125% of the load
>> of the current node. I think that, for my system at least, we need
>> to add in a constant to this equation. I tried using 4 and that
>> helped a little.
>
> Michael,
>
> in:
>
> +static int find_busiest_node(int this_node)
> +{
> + int i, node = this_node, load, this_load, maxload;
> +
> + this_load = maxload = atomic_read(&node_nr_running[this_node]);
> + for (i = 0; i < numnodes; i++) {
> + if (i == this_node)
> + continue;
> + load = atomic_read(&node_nr_running[i]);
> + if (load > maxload && (4*load > ((5*4*this_load)/4))) {
> + maxload = load;
> + node = i;
> + }
> + }
> + return node;
> +}
>
> You changed ((5*4*this_load)/4) to:
> (5*4*(this_load+4)/4)
> or
> (4+(5*4*(this_load)/4)) ?
>
> We def need some constant to avoid low load ping pong, right?
>
> Finally I added in the 04 patch, and that helped
>> a lot. Still, there is too much process movement between nodes.
>
> perhaps increase INTERNODE_LB?

Before we tweak this too much, how about using the global load
average for this? I can envisage a situation where we have two
nodes with 8 tasks per node, one with 12 tasks, and one with four.
You really don't want the ones with 8 tasks pulling stuff from
the 12 ... only for the least loaded node to start pulling stuff
later.

What about if we take the global load average, and multiply by
num_cpus_on_this_node / num_cpus_globally ... that'll give us
roughly what we should have on this node. If we're significantly
out underloaded compared to that, we start pulling stuff from
the busiest node? And you get the damping over time for free.

I think it'd be best if we stay fairly conservative for this, all
we're trying to catch is the corner case where a bunch of stuff
has forked, but not execed, and we have a long term imbalance.
Agressive rebalance might win a few benchmarks, but you'll just
cause inter-node task bounce on others.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Valdis.Kletnieks@vt.edu: "Re: [PATCH] [TRIVIAL] kstrdup"
Previous message: Protasevich, Natalie: "RE: APIC version"
In reply to: Andrew Theurer: "Re: [Lse-tech] Re: NUMA scheduler 2nd approach"
Next in thread: Erich Focht: "Re: [Lse-tech] Re: NUMA scheduler 2nd approach"
Reply: Erich Focht: "Re: [Lse-tech] Re: NUMA scheduler 2nd approach"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jan 15 2003 - 22:00:49 EST