Re: [PATCH v3 5/5] cpusets, suspend: Save and restore cpusets duringsuspend/resume

From: Nishanth Aravamudan
Date: Tue May 15 2012 - 00:45:48 EST


On 14.05.2012 [21:04:16 -0700], David Rientjes wrote:
> On Mon, 14 May 2012, Nishanth Aravamudan wrote:
>
> > > I see what you're doing with this and think it will fix the problem that
> > > you're trying to address, but I think it could become much more general
> > > to just the suspend case: if an admin sets a cpuset to have cpus 4-6, for
> > > example, and cpu 5 goes offline, then I believe the cpuset should once
> > > again become 4-6 if cpu 5 comes back online. So I think this should be
> > > implemented like mempolicies are which save the user intended nodemask
> > > that may become restricted by cpuset placement but will be rebound if the
> > > cpuset includes the intended nodes.
> >
> > Heh, please read the thread at
> > http://marc.info/?l=linux-kernel&m=133615922717112&w=2 ... subject is
> > "[PATCH v2 0/7] CPU hotplug, cpusets: Fix issues with cpusets handling
> > upon CPU hotplug". That was effectively the same solution Srivatsa
> > originally posted. But after lengthy discussions with PeterZ and others,
> > it was decided that suspend/resume is a special case where it makes
> > sense to save "policy" but that generally cpu/memory hotplug is a
> > destructive operation and nothing is required to be retained (that
> > certain policies are retained is unfortunately now expected, but isn't
> > guaranteed for cpusets, at least).
> >
>
> If you do set_mempolicy(MPOL_BIND, 2-3) to bind a thread to nodes 2-3
> that is attached to a cpuset whereas cpuset.mems == 2-3, and then
> cpuset.mems changes to 0-1, what is the expected behavior? Do we
> immediately oom on the next allocation? If cpuset.mems is set again
> to 2-3, what's the desired behavior?

"expected [or desired] behavior" always makes me cringe. It's usually
some insane user-level expectations that don't really make sense :).
But I don't honestly know the answer here as I've not polled any
customers on it. `man cpuset` does provide some insight into the
implementation, though:

Cpusets are integrated with the sched_setaffinity(2) scheduling
affinity mechanism and the mbind(2) and set_mempolicy(2)
memory-placement mechanisms in the kernel. Neither of these
mechanisms let a process make use of a CPU or memory node that
is not allowed by that process's cpuset. If changes to a
process's cpuset placement conflict with these other mechanisms,
then cpuset placement is enforced even if it means overriding
these other mechanisms. The kernel accomplishes this overriding
by silently restricting the CPUs and memory nodes requested by
these other mechanisms to those allowed by the invoking
process's cpuset. This can result in these other calls
returning an error, if for example, such a call ends up
requesting an empty set of CPUs or memory nodes, after that
request is restricted to the invoking process's cpuset.

So no, it should not OOM, but instead the mempolicy is ignored.

> I fixed this problem by introducing MPOL_F_* flags in set_mempolicy(2)
> by saving the user intended nodemask passed by set_mempolicy() and
> respecting it whenever allowed by cpusets.

So, if you read that thread, this is what (in essence) Srivatsa proposed
in v2. We store the user-defined cpumask and keep it regardless of
kernel decisions. We intersect the user-defined cpumask with the kernel
(which is really reflecting the administrator's hotplug decisions)
topology and run tasks in constrained cpusets on the result. We reflect
this decision in a new read-only file in each cpuset that indicates the
"actual" cpus that a task in a given cpuset may be scheduled on.

But PeterZ nack-ed it and his reasoning was sound -- CPU (and memory, I
would think) hotplug is a necessarily destructive behavior.

> Right now, the behavior of what happens for a cpuset where cpuset.cpus ==
> 2-3 and then cpus 2-3 go offline and then are brought back online is
> undefined.

Erm, no it's rather clearly defined by what actually happens. It may not
be "specified" in a formal document, but behavior is a heckuva thing.

What happens is that the offlining process pushes the tasks in that
constrained cpuset up into the parent cpuset (actually moves them). In a
suspend case, since we're offlining all CPUs, this results in all task
being pushed up to the root cpuset.

I would also quote `man cpuset` here to actually say the behavior is
"specified", technically:

If hot-plug functionality is used to remove all the CPUs that
are currently assigned to a cpuset, then the kernel will
automatically update the cpus_allowed of all processes attached
to CPUs in that cpuset to allow all CPUs.

The fact that those CPUs are eventually (or immediately) brought back
online is not considered in the decision of how to handle tasks in the
constrained cpuset when the CPUs are taken offline. That seems to make
sense, since there isn't any guarantee that an offlined CPU will ever
return to online status in the future.

> The same is true of cpuset.cpus during resume. So if you're going to
> add a cpumask to struct cpuset, then why not respect it for all
> offline events and get rid of all this specialized suspend-only stuff?
> It's very simple to make this consistent across all cpu hotplug events
> and build suspend on top of it from a cpuset perspective.

"simple" -- sure. Read v2 of the patchset, as I said. But then read all
the discussion that follows and I think you will see that this has been
hashed out before with similar reasoning on both sides, and that the
policy side of things is not obviously simply. The resulting decision
was to special-case suspend, but not "remember" state across other
hotplug actions, which is more of an "unintentional hotplug" (and from
what Paul McKenney mentions in that thread, sounds like tglx is working
on patches to remove the full hotplug usage from s/r).

Thanks,
Nish

--
Nishanth Aravamudan <nacc@xxxxxxxxxx>
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/