Re: [RFC][PATCH RT 4/4 v2] sched/rt: Use IPI to trigger RT task pushmigration instead of pulling

From: John Kacur
Date: Wed Feb 13 2013 - 11:49:40 EST




On Thu, 13 Dec 2012, Steven Rostedt wrote:

> I didn't get a chance to test the latest IPI patch series on the 40 core
> box, and only had my 4 way box to test on. But I was able to test it
> last night and found some issues.
>
> The RT_PUSH_IPI doesn't get automatically set because just doing the
> sched_feat_enable() wasn't enough. Below is the corrected patch.
>
> Also, for some reason patch 3 caused the box to hang. Perhaps it
> required the RT_PUSH_IPI set, because it worked with the original patch
> series. But that series only did the push ipi. I removed it on the 40
> core before noticing that the RT_PUSH_IPI wasn't being automatically
> enabled.
>
> Here's an update of patch 4:
>
> sched/rt: Use IPI to trigger RT task push migration instead of pulling
>
> When debugging the latencies on a 40 core box, where we hit 300 to
> 500 microsecond latencies, I found there was a huge contention on the
> runqueue locks.
>
> Investigating it further, running ftrace, I found that it was due to
> the pulling of RT tasks.
>
> The test that was run was the following:
>
> cyclictest --numa -p95 -m -d0 -i100
>
> This created a thread on each CPU, that would set its wakeup in interations
> of 100 microseconds. The -d0 means that all the threads had the same
> interval (100us). Each thread sleeps for 100us and wakes up and measures
> its latencies.
>
> What happened was another RT task would be scheduled on one of the CPUs
> that was running our test, when the other CPUS test went to sleep and
> scheduled idle. This cause the "pull" operation to execute on all
> these CPUs. Each one of these saw the RT task that was overloaded on
> the CPU of the test that was still running, and each one tried
> to grab that task in a thundering herd way.
>
> To grab the task, each thread would do a double rq lock grab, grabbing
> its own lock as well as the rq of the overloaded CPU. As the sched
> domains on this box was rather flat for its size, I saw up to 12 CPUs
> block on this lock at once. This caused a ripple affect with the
> rq locks. As these locks were blocked, any wakeups or load balanceing
> on these CPUs would also block on these locks, and the wait time escalated.
>
> I've tried various methods to lesson the load, but things like an
> atomic counter to only let one CPU grab the task wont work, because
> the task may have a limited affinity, and we may pick the wrong
> CPU to take that lock and do the pull, to only find out that the
> CPU we picked isn't in the task's affinity.
>
> Instead of doing the PULL, I now have the CPUs that want the pull to
> send over an IPI to the overloaded CPU, and let that CPU pick what
> CPU to push the task to. No more need to grab the rq lock, and the
> push/pull algorithm still works fine.
>
> With this patch, the latency dropped to just 150us over a 20 hour run.
> Without the patch, the huge latencies would trigger in seconds.
>
> Now, this issue only seems to apply to boxes with greater than 16 CPUs.
> We noticed this on a 24 CPU box, and things got much worse on 40 (and
> presumably more CPUs would get even worse yet). But running with 16
> CPUs and below, the lock contention caused by the pulling of RT tasks
> is not noticable.
>
> I've created a new sched feature called RT_PUSH_IPI, which by default
> on 16 and less core CPUs is disabled, and on 17 or more CPUs it is
> enabled. That seems to be heuristic limit where the pulling logic
> causes higher latencies than IPIs. Of course with all heuristics, things
> could be different with different architectures.
>
> When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
> and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
> is enabled, the IPI is sent to the overloaded CPU to do a push.
>
> To enabled or disable this at run time:
>
> # mount -t debugfs nodev /sys/kernel/debug
> # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
> or
> # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
>
> Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
>
> Index: rt-linux.git/kernel/sched/core.c
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/core.c
> +++ rt-linux.git/kernel/sched/core.c
> @@ -1538,6 +1538,9 @@ static void sched_ttwu_pending(void)
>
> void scheduler_ipi(void)
> {
> + if (sched_feat(RT_PUSH_IPI))
> + sched_rt_push_check();
> +
> if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> return;
>
> @@ -7541,6 +7544,21 @@ void __init sched_init_smp(void)
> free_cpumask_var(non_isolated_cpus);
>
> init_sched_rt_class();
> +
> + /*
> + * To avoid heavy contention on large CPU boxes,
> + * when there is an RT overloaded CPU (two or more RT tasks
> + * queued to run on a CPU and one of the waiting RT tasks
> + * can migrate) and another CPU lowers its priority, instead
> + * of grabbing both rq locks of the CPUS (as many CPUs lowering
> + * their priority at the same time may create large latencies)
> + * send an IPI to the CPU that is overloaded so that it can
> + * do an efficent push.
> + */
> + if (num_possible_cpus() > 16) {
> + sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
> + sysctl_sched_features |= (1UL << __SCHED_FEAT_RT_PUSH_IPI);
> + }
> }
> #else
> void __init sched_init_smp(void)
> Index: rt-linux.git/kernel/sched/rt.c
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/rt.c
> +++ rt-linux.git/kernel/sched/rt.c
> @@ -1723,6 +1723,31 @@ static void push_rt_tasks(struct rq *rq)
> ;
> }
>
> +/**
> + * sched_rt_push_check - check if we can push waiting RT tasks
> + *
> + * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
> + *
> + * Checks if there is an RT task that can migrate and there exists
> + * a CPU in its affinity that only has tasks lower in priority than
> + * the waiting RT task. If so, then it will push the task off to that
> + * CPU.
> + */
> +void sched_rt_push_check(void)
> +{
> + struct rq *rq = cpu_rq(smp_processor_id());
> +
> + if (WARN_ON_ONCE(!irqs_disabled()))
> + return;
> +
> + if (!has_pushable_tasks(rq))
> + return;
> +
> + raw_spin_lock(&rq->lock);
> + push_rt_tasks(rq);
> + raw_spin_unlock(&rq->lock);
> +}
> +
> static int pull_rt_task(struct rq *this_rq)
> {
> int this_cpu = this_rq->cpu, ret = 0, cpu;
> @@ -1750,6 +1775,18 @@ static int pull_rt_task(struct rq *this_
> continue;
>
> /*
> + * When the RT_PUSH_IPI sched feature is enabled, instead
> + * of trying to grab the rq lock of the RT overloaded CPU
> + * send an IPI to that CPU instead. This prevents heavy
> + * contention from several CPUs lowering its priority
> + * and all trying to grab the rq lock of that overloaded CPU.
> + */
> + if (sched_feat(RT_PUSH_IPI)) {
> + smp_send_reschedule(cpu);
> + continue;
> + }
> +
> + /*
> * We can potentially drop this_rq's lock in
> * double_lock_balance, and another CPU could
> * alter this_rq
> Index: rt-linux.git/kernel/sched/sched.h
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/sched.h
> +++ rt-linux.git/kernel/sched/sched.h
> @@ -1111,6 +1111,8 @@ static inline void double_rq_unlock(stru
> __release(rq2->lock);
> }
>
> +void sched_rt_push_check(void);
> +
> #else /* CONFIG_SMP */
>
> /*
> @@ -1144,6 +1146,9 @@ static inline void double_rq_unlock(stru
> __release(rq2->lock);
> }
>
> +void sched_rt_push_check(void)
> +{
> +}
> #endif
>
> extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
> Index: rt-linux.git/kernel/sched/features.h
> ===================================================================
> --- rt-linux.git.orig/kernel/sched/features.h
> +++ rt-linux.git/kernel/sched/features.h
> @@ -73,6 +73,20 @@ SCHED_FEAT(PREEMPT_LAZY, true)
> # endif
> #endif
>
> +/*
> + * In order to avoid a thundering herd attack of CPUS that are
> + * lowering their priorities at the same time, and there being
> + * a single CPU that has an RT task that can migrate and is waiting
> + * to run, where the other CPUs will try to take that CPUs
> + * rq lock and possibly create a large contention, sending an
> + * IPI to that CPU and let that CPU push the RT task to where
> + * it should go may be a better scenario.
> + *
> + * This is default off for machines with <= 16 CPUs, and will
> + * be turned on at boot up for machines with > 16 CPUs.
> + */
> +SCHED_FEAT(RT_PUSH_IPI, false)
> +
> SCHED_FEAT(FORCE_SD_OVERLAP, false)
> SCHED_FEAT(RT_RUNTIME_SHARE, true)
> SCHED_FEAT(LB_MIN, false)
>

FWIW: Applying this to our latest test queue.

Thanks

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/