Re: [PATCH v2] Make transparent hugepages cpuset aware

From: Robin Holt
Date: Wed Jun 19 2013 - 22:27:49 EST


On Wed, Jun 19, 2013 at 02:24:07PM -0700, David Rientjes wrote:
> On Wed, 19 Jun 2013, Robin Holt wrote:
>
> > The convenience being that many batch schedulers have added cpuset
> > support. They create the cpuset's and configure them as appropriate
> > for the job as determined by a mixture of input from the submitting
> > user but still under the control of the administrator. That seems like
> > a fairly significant convenience given that it took years to get the
> > batch schedulers to adopt cpusets in the first place. At this point,
> > expanding their use of cpusets is under the control of the system
> > administrator and would not require any additional development on
> > the batch scheduler developers part.
> >
>
> You can't say the same for memcg?

I am not aware of batch scheduler support for memory controllers.
The request came from our benchmarking group.

> > Here are the entries in the cpuset:
> > cgroup.event_control mem_exclusive memory_pressure_enabled notify_on_release tasks
> > cgroup.procs mem_hardwall memory_spread_page release_agent
> > cpu_exclusive memory_migrate memory_spread_slab sched_load_balance
> > cpus memory_pressure mems sched_relax_domain_level
> >
> > There are scheduler, slab allocator, page_cache layout, etc controls.
>
> I think this is mostly for historical reasons since cpusets were
> introduced before cgroups.
>
> > Why _NOT_ add a thp control to that nicely contained central location?
> > It is a concise set of controls for the job.
> >
>
> All of the above seem to be for cpusets primary purpose, i.e. NUMA
> optimizations. It has nothing to do with transparent hugepages. (I'm not
> saying thp has anything to do with memcg either, but a "memory controller"
> seems more appropriate for controlling thp behavior.)

cpusets was not for NUMA. It has no preference for "nodes" or anything like
that. It was for splitting a machine into layered smaller groups. Usually,
we see one cpuset with contains the batch scheduler. The batch scheduler then
creates cpusets for jobs it starts. Has nothing to do with nodes. That is
more an administrator issue. They set the minimum grouping of resources
for scheduled jobs.

> > Maybe I am misunderstanding. Are you saying you want to put memcg
> > information into the cpuset or something like that?
> >
>
> I'm saying there's absolutely no reason to have thp controlled by a
> cpuset, or ANY cgroup for that matter, since you chose not to respond to
> the question I asked: why do you want to control thp behavior for certain
> static binaries and not others? Where is the performance regression or
> the downside? Is it because of max_ptes_none for certain jobs blowing up
> the rss? We need information, and even if were justifiable then it
> wouldn't have anything to do with ANY cgroup but rather a per-process
> control. It has nothing to do with cpusets whatsoever.

It was a request from our benchmarking group that has found some jobs
benefit from thp, while other are harmed. Let me ask them for more
details.

> (And I'm very curious why you didn't even cc the cpusets maintainer on
> this patch in the first place who would probably say the same thing.)

I didn't know there was a cpuset maintainer. Paul Jackson (SGI retired)
had originally worked to get cpusets introduced and then converted to
use cgroups. I had never known there was a maintainer after him. Sorry
for that.

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/