Re: [RFC PATCH 02/10] sched: Task placement for heterogeneoussystems based on task load-tracking

From: Morten Rasmussen
Date: Tue Oct 09 2012 - 11:56:19 EST

Next message: Konrad Rzeszutek Wilk: "Re: [PATCH 2/3] x86, mm: Don't clear page table if next range is ram"
Previous message: Evgeniy Polyakov: "Re: [PATCH] w1/ds2482: use module_i2c_driver to simplify the code"
Next in thread: Viresh Kumar: "Re: [RFC PATCH 02/10] sched: Task placement for heterogeneous systemsbased on task load-tracking"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Viresh,

On Thu, Oct 04, 2012 at 07:02:03AM +0100, Viresh Kumar wrote:
> Hi Morten,
>
> On 22 September 2012 00:02, <morten.rasmussen@xxxxxxx> wrote:
> > From: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> >
> > This patch introduces the basic SCHED_HMP infrastructure. Each class of
> > cpus is represented by a hmp_domain and tasks will only be moved between
> > these domains when their load profiles suggest it is beneficial.
> >
> > SCHED_HMP relies heavily on the task load-tracking introduced in Paul
> > Turners fair group scheduling patch set:
> >
> > <https://lkml.org/lkml/2012/8/23/267>
> >
> > SCHED_HMP requires that the platform implements arch_get_hmp_domains()
> > which should set up the platform specific list of hmp_domains. It is
> > also assumed that the platform disables SD_LOAD_BALANCE for the
> > appropriate sched_domains.
>
> An explanation of this requirement would be helpful here.
>

Yes. This is to prevent the load-balancer from moving tasks between
hmp_domains. This will be done exclusively by SCHED_HMP instead to
implement a strict task migration policy and avoid changing the
load-balancer behaviour. The load-balancer will take care of
load-balacing within each hmp_domain.

> > Tasks placement takes place every time a task is to be inserted into
> > a runqueue based on its load history. The task placement decision is
> > based on load thresholds.
> >
> > There are no restrictions on the number of hmp_domains, however,
> > multiple (>2) has not been tested and the up/down migration policy is
> > rather simple.
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > ---
> > arch/arm/Kconfig | 17 +++++
> > include/linux/sched.h | 6 ++
> > kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++
> > kernel/sched/sched.h | 6 ++
> > 4 files changed, 197 insertions(+)
> >
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index f4a5d58..5b09684 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -1554,6 +1554,23 @@ config SCHED_SMT
> > MultiThreading at a cost of slightly increased overhead in some
> > places. If unsure say N here.
> >
> > +config DISABLE_CPU_SCHED_DOMAIN_BALANCE
> > + bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
> > + help
> > + Disables scheduler load-balancing at CPU sched domain level.
>
> Shouldn't this depend on EXPERIMENTAL?
>

It should. The ongoing discussion about CONFIG_EXPERIMENTAL that Amit is
referring to hasn't come to a conclusion yet.

> > +config SCHED_HMP
> > + bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling"
>
> ditto.
>
> > + depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE && SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP
> > + help
> > + Experimental scheduler optimizations for heterogeneous platforms.
> > + Attempts to introspectively select task affinity to optimize power
> > + and performance. Basic support for multiple (>2) cpu types is in place,
> > + but it has only been tested with two types of cpus.
> > + There is currently no support for migration of task groups, hence
> > + !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
> > + between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
> > +
> > config HAVE_ARM_SCU
> > bool
> > help
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 81e4e82..df971a3 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
> >
> > bool cpus_share_cache(int this_cpu, int that_cpu);
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +struct hmp_domain {
> > + struct cpumask cpus;
> > + struct list_head hmp_domains;
>
> Probably need a better name here. domain_list?
>

Yes. hmp_domain_list would be better and stick with the hmp_* naming
convention.

> > +};
> > +#endif /* CONFIG_SCHED_HMP */
> > #else /* CONFIG_SMP */
> >
> > struct sched_domain_attr;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3e17dd5..d80de46 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, int target)
> > return target;
> > }
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +/*
> > + * Heterogenous multiprocessor (HMP) optimizations
> > + *
> > + * The cpu types are distinguished using a list of hmp_domains
> > + * which each represent one cpu type using a cpumask.
> > + * The list is assumed ordered by compute capacity with the
> > + * fastest domain first.
> > + */
> > +DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
> > +
> > +extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list);
> > +
> > +/* Setup hmp_domains */
> > +static int __init hmp_cpu_mask_setup(void)
>
> How should we interpret its return value? Can you mention what does 0 & 1 mean
> here?
>

Returns 0 if domain setup failed, i.e. the domain list is empty, and 1
otherwise.

> > +{
> > + char buf[64];
> > + struct hmp_domain *domain;
> > + struct list_head *pos;
> > + int dc, cpu;
> > +
> > + pr_debug("Initializing HMP scheduler:\n");
> > +
> > + /* Initialize hmp_domains using platform code */
> > + arch_get_hmp_domains(&hmp_domains);
> > + if (list_empty(&hmp_domains)) {
> > + pr_debug("HMP domain list is empty!\n");
> > + return 0;
> > + }
> > +
> > + /* Print hmp_domains */
> > + dc = 0;
>
> Should be done during definition of dc.
>
> > + list_for_each(pos, &hmp_domains) {
> > + domain = list_entry(pos, struct hmp_domain, hmp_domains);
> > + cpulist_scnprintf(buf, 64, &domain->cpus);
> > + pr_debug(" HMP domain %d: %s\n", dc, buf);
>
> Spaces before HMP are intentional?
>

Yes. It makes the boot log easier to read when the hmp_domain listing is
indented.

> > +
> > + for_each_cpu_mask(cpu, domain->cpus) {
> > + per_cpu(hmp_cpu_domain, cpu) = domain;
> > + }
>
> Should use hmp_cpu_domain(cpu) here. Also no need of {} for single
> line loop.
>
> > + dc++;
>
> You aren't using it... Only for testing? Should we remove it from mainline
> patchset and keep it locally?
>

I'm using it in the pr_debug line a little earlier. It is used for
enumerating the hmp_domains.

> > + }
> > +
> > + return 1;
> > +}
> > +
> > +/*
> > + * Migration thresholds should be in the range [0..1023]
> > + * hmp_up_threshold: min. load required for migrating tasks to a faster cpu
> > + * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
> > + * The default values (512, 256) offer good responsiveness, but may need
> > + * tweaking suit particular needs.
> > + */
> > +unsigned int hmp_up_threshold = 512;
> > +unsigned int hmp_down_threshold = 256;
>
> For default values, it is fine. But still we should get user preferred
> values via DT
> or CONFIG_*.
>

Yes, but for now getting the scheduler to do the right thing has higher
priority than proper integration with DT.

> > +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se);
> > +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
> > +
> > +/* Check if cpu is in fastest hmp_domain */
> > +static inline unsigned int hmp_cpu_is_fastest(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return pos == hmp_domains.next;
>
> better create list_is_first() for this.
>

I had the same thought, but I see that as a separate patch that should
be submitted separately.

> > +}
> > +
> > +/* Check if cpu is in slowest hmp_domain */
> > +static inline unsigned int hmp_cpu_is_slowest(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return list_is_last(pos, &hmp_domains);
> > +}
> > +
> > +/* Next (slower) hmp_domain relative to cpu */
> > +static inline struct hmp_domain *hmp_slower_domain(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return list_entry(pos->next, struct hmp_domain, hmp_domains);
> > +}
> > +
> > +/* Previous (faster) hmp_domain relative to cpu */
> > +static inline struct hmp_domain *hmp_faster_domain(int cpu)
> > +{
> > + struct list_head *pos;
> > +
> > + pos = &hmp_cpu_domain(cpu)->hmp_domains;
> > + return list_entry(pos->prev, struct hmp_domain, hmp_domains);
> > +}
>
> For all four routines, first two lines of body can be merged. If u wish :)
>

I have kept these helper functions fairly generic on purpose. It might
be necessary for multi-domain platforms (>2) to modify these functions
to implement better multi-domain task migration policies. I don't know
any such platform, so for know these functions are very simple.

> > +
> > +/*
> > + * Selects a cpu in previous (faster) hmp_domain
> > + * Note that cpumask_any_and() returns the first cpu in the cpumask
> > + */
> > +static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk,
> > + int cpu)
> > +{
> > + return cpumask_any_and(&hmp_faster_domain(cpu)->cpus,
> > + tsk_cpus_allowed(tsk));
> > +}
> > +
> > +/*
> > + * Selects a cpu in next (slower) hmp_domain
> > + * Note that cpumask_any_and() returns the first cpu in the cpumask
> > + */
> > +static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk,
> > + int cpu)
> > +{
> > + return cpumask_any_and(&hmp_slower_domain(cpu)->cpus,
> > + tsk_cpus_allowed(tsk));
> > +}
> > +
> > +#endif /* CONFIG_SCHED_HMP */
> > +
> > /*
> > * sched_balance_self: balance the current task (running on cpu) in domains
> > * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
> > @@ -3203,6 +3322,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> > unlock:
> > rcu_read_unlock();
> >
> > +#ifdef CONFIG_SCHED_HMP
> > + if (hmp_up_migration(prev_cpu, &p->se))
> > + return hmp_select_faster_cpu(p, prev_cpu);
> > + if (hmp_down_migration(prev_cpu, &p->se))
> > + return hmp_select_slower_cpu(p, prev_cpu);
> > + /* Make sure that the task stays in its previous hmp domain */
> > + if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus))
>
> Why is this tested?
>

I don't think it is needed. It is there as an extra guarantee that
select_task_rq_fair() doesn't pick a cpu outside the task's current
hmp_domain in cases where there is no up or down migration. Disabling
SD_LOAD_BALANCE for the appropriate domains should give that guarantee.
I just haven't completely convinced myself yet.

> > + return prev_cpu;
> > +#endif
> > +
> > return new_cpu;
> > }
> >
> > @@ -5354,6 +5483,41 @@ need_kick:
> > static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { }
> > #endif
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +/* Check if task should migrate to a faster cpu */
> > +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se)
> > +{
> > + struct task_struct *p = task_of(se);
> > +
> > + if (hmp_cpu_is_fastest(cpu))
> > + return 0;
> > +
> > + if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus,
> > + tsk_cpus_allowed(p))
> > + && se->avg.load_avg_ratio > hmp_up_threshold) {
> > + return 1;
> > + }
>
> I know all these comparisons are not very costly, still i would prefer
>
> se->avg.load_avg_ratio > hmp_up_threshold
>
> as the first comparison in this routine.
>
> We should check first, if the task needs migration or not. Rather than
> checking if it can migrate to other cpus or not.
>

Agree.

> > + return 0;
> > +}
> > +
> > +/* Check if task should migrate to a slower cpu */
> > +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se)
> > +{
> > + struct task_struct *p = task_of(se);
> > +
> > + if (hmp_cpu_is_slowest(cpu))
> > + return 0;
> > +
> > + if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus,
> > + tsk_cpus_allowed(p))
> > + && se->avg.load_avg_ratio < hmp_down_threshold) {
> > + return 1;
> > + }
>
> same here.
>

Agree.

> > + return 0;
> > +}
> > +
> > +#endif /* CONFIG_SCHED_HMP */
> > +
> > /*
> > * run_rebalance_domains is triggered when needed from the scheduler tick.
> > * Also triggered for nohz idle balancing (with nohz_balancing_kick set).
> > @@ -5861,6 +6025,10 @@ __init void init_sched_fair_class(void)
> > zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> > cpu_notifier(sched_ilb_notifier, 0);
> > #endif
> > +
> > +#ifdef CONFIG_SCHED_HMP
> > + hmp_cpu_mask_setup();
>
> Should we check the return value? If not required then should we
> make fn() declaration return void?
>

It can be changed to void if we don't add any error handling anyway.

> > +#endif
> > #endif /* SMP */
> >
> > }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 81135f9..4990d9e 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -547,6 +547,12 @@ DECLARE_PER_CPU(int, sd_llc_id);
> >
> > extern int group_balance_cpu(struct sched_group *sg);
> >
> > +#ifdef CONFIG_SCHED_HMP
> > +static LIST_HEAD(hmp_domains);
> > +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
> > +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu)))
>
> can drop "()" around per_cpu().
>
> Both, per_cpu variable and macro to get it, have the same name. Can
> we try giving them better names. Or atleast add an "_" before per_cpu
> pointers name?
>

Yes.

> --
> viresh
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Konrad Rzeszutek Wilk: "Re: [PATCH 2/3] x86, mm: Don't clear page table if next range is ram"
Previous message: Evgeniy Polyakov: "Re: [PATCH] w1/ds2482: use module_i2c_driver to simplify the code"
Next in thread: Viresh Kumar: "Re: [RFC PATCH 02/10] sched: Task placement for heterogeneous systemsbased on task load-tracking"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]