Re: v2.6.21.4-rt11

From: Paul E. McKenney
Date: Tue Jun 19 2007 - 11:08:21 EST


On Tue, Jun 19, 2007 at 11:04:30AM +0200, Ingo Molnar wrote:
>
> * Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx> wrote:
>
> > I believe the patch below is correct. With the patch applied, I could
> > not recreate the imbalance with rcutorture. Let me know whether you
> > still see the problem with this patch applied on any other machine.
>
> thanks for tracking this down! I've applied Christoph's patch (with your
> suggested modification plus a few small cleanups).
>
> I'm wondering, why did this trigger under CFS and not on mainline?
> Mainline seems to have a similar problem in idle_balance() too, or am i
> misreading it?

It did in fact trigger under all three of mainline, CFS, and -rt including
CFS -- see below for a couple of emails from last Friday giving results
for these three on the AMD box (where it happened) and on a single-quad
NUMA-Q system (where it did not, at least not with such severity).

That said, there certainly was a time when neither mainline nor -rt
acted this way!

Thanx, Paul

------------------------------------------------------------------------

Date: Fri, 15 Jun 2007 13:06:17 -0700
From: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
To: Ingo Molnar <mingo@xxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>,
Dinakar Guniguntala <dino@xxxxxxxxxx>
Subject: Re: v2.6.21.4-rt11

On Fri, Jun 15, 2007 at 08:14:52AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 15, 2007 at 04:45:35PM +0200, Ingo Molnar wrote:
> >
> > Paul,
> >
> > do you still see the load-distribution problem with -rt14? (which
> > includes cfsv17) Or rather ... could you try vanilla cfsv17 instead:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/
> >
> > to make sure it's not some effect in -rt causing this. v17 has an
> > updated load balancing code. (which might or might not affect the
> > rcutorture problem.)

No joy, see below. Strangely hardware dependent. My next step, left
to myself, would be to patch rcutorture.c to cause the readers to dump
the CFS state information every ten seconds or so. My guess is that
the important per-task stuff is:

current->sched_info.pcnt
current->sched_info.cpu_time
current->sched_info.run_delay
current->sched_info.last_arrival
current->sched_info.last_queued

And maybe the runqueue info dumped out by show_schedstat, this last
via new per-CPU tasks.

Other thoughts?

Thanx, Paul

> Good point! I will try the following:
>
> 1. Stock 2.6.21.5. 64-bit kernel on AMD Opterons.

All eight readers end up on the same CPU, CPU 1 in this case. And they
stay there (ten minutes).

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3058 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3059 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3060 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3061 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3062 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3063 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3057 root 39 19 0 0 0 R 12.3 0.0 0:06.91 rcu_torture_rea
3064 root 39 19 0 0 0 R 12.3 0.0 0:06.91 rcu_torture_rea

> 1. Stock 2.6.21.5. 32-bit kernel on NUMA-Q.

Works just fine(!).

> 2. 2.6.21-rt14. 64-bit kernel on AMD Opterons.

All eight readers are spread, but over only two CPUs (0 and 3, in this
case). Persists, usually with 4/4 split, but sometimes with five
tasks on one CPU and three on the other.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3111 root 39 19 0 0 0 R 23.9 0.0 0:27.27 rcu_torture_rea
3114 root 39 19 0 0 0 R 23.9 0.0 0:28.58 rcu_torture_rea
3117 root 39 19 0 0 0 R 23.9 0.0 0:32.40 rcu_torture_rea
3112 root 39 19 0 0 0 R 23.6 0.0 0:28.41 rcu_torture_rea
3110 root 39 19 0 0 0 R 22.9 0.0 0:43.46 rcu_torture_rea
3113 root 39 19 0 0 0 R 22.9 0.0 0:27.28 rcu_torture_rea
3115 root 39 19 0 0 0 R 22.9 0.0 0:33.08 rcu_torture_rea
3116 root 39 19 0 0 0 R 22.6 0.0 0:28.10 rcu_torture_rea

elm3b6:~# for ((i=3110;i<=3117;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
3 3 0 3 0 0 0 3

> 2. 2.6.21-rt14. 32-bit kernel on NUMA-Q.

Works just fine.

> 3. 2.6.21.5 + sched-cfs-v2.6.21.5-v17.patch on 64-bit kernel on
AMD Opteron.

All eight readers end up on the same CPU, CPU 2 in this case. And they
stay there (ten minutes).

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3081 root 39 19 0 0 0 R 11.3 0.0 1:31.77 rcu_torture_rea
3082 root 39 19 0 0 0 R 11.3 0.0 1:31.77 rcu_torture_rea
3085 root 39 19 0 0 0 R 11.3 0.0 1:31.78 rcu_torture_rea
3079 root 39 19 0 0 0 R 11.0 0.0 1:31.72 rcu_torture_rea
3080 root 39 19 0 0 0 R 11.0 0.0 1:31.76 rcu_torture_rea
3083 root 39 19 0 0 0 R 11.0 0.0 1:31.76 rcu_torture_rea
3084 root 39 19 0 0 0 R 11.0 0.0 1:31.77 rcu_torture_rea
3086 root 39 19 0 0 0 R 11.0 0.0 1:31.75 rcu_torture_rea

Using "taskset" to pin each process to a pair of CPUs (masks 0x3, 0x6,
0xc, and 0x9) forces them to CPUs 0 and 2 -- previously this had spread
them nicely. So I kept pinning tasks to single CPUs (which defeats
some rcutorture testing) until they did spread, getting the following:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3079 root 39 19 0 0 0 R 49.9 0.0 3:18.91 rcu_torture_rea
3080 root 39 19 0 0 0 R 49.6 0.0 3:15.76 rcu_torture_rea
3086 root 39 19 0 0 0 R 49.6 0.0 3:48.82 rcu_torture_rea
3083 root 39 19 0 0 0 R 49.3 0.0 2:58.02 rcu_torture_rea
3084 root 39 19 0 0 0 R 48.6 0.0 3:00.54 rcu_torture_rea
3081 root 39 19 0 0 0 R 47.9 0.0 3:00.55 rcu_torture_rea
3082 root 39 19 0 0 0 R 44.6 0.0 3:18.89 rcu_torture_rea
3085 root 39 19 0 0 0 R 44.3 0.0 3:07.11 rcu_torture_rea

elm3b6:~# for ((i=3079;i<=3086;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
0 0 2 1 3 2 1 3

> 3. 2.6.21.5 + sched-cfs-v2.6.21.5-v17.patch on 32-bit kernel on
NUMA-Q.

Some imbalance:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2263 root 39 19 0 0 0 R 92.4 0.0 2:19.69 rcu_torture_rea
2265 root 39 19 0 0 0 R 49.8 0.0 1:41.84 rcu_torture_rea
2264 root 39 19 0 0 0 R 49.5 0.0 2:11.69 rcu_torture_rea
2261 root 39 19 0 0 0 R 48.8 0.0 2:09.95 rcu_torture_rea
2262 root 39 19 0 0 0 R 48.8 0.0 3:01.42 rcu_torture_rea
2266 root 39 19 0 0 0 R 30.1 0.0 1:47.02 rcu_torture_rea
2260 root 39 19 0 0 0 R 29.8 0.0 2:10.07 rcu_torture_rea
2267 root 39 19 0 0 0 R 29.8 0.0 1:57.34 rcu_torture_rea

elm3b132:~# for ((i=2260;i<=2267;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
0 1 3 1 2 2 2 0

Has persisted (with some shuffling of CPUs, see below) for about five
minutes, will let it run for an hour or so to see if it is really serious
about this.

3 0 0 2 1 0 2 3

------------------------------------------------------------------------

Date: Fri, 15 Jun 2007 15:00:17 -0700
From: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
To: Ingo Molnar <mingo@xxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>,
Dinakar Guniguntala <dino@xxxxxxxxxx>,
Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx>,
Dmitry Adamushko <dmitry.adamushko@xxxxxxxxx>
Subject: Re: v2.6.21.4-rt11

On Fri, Jun 15, 2007 at 10:35:39PM +0200, Ingo Molnar wrote:
>
> (forwarding Paul's mail below to other CFS hackers too.)
>
> ------------>

[ . . . ]

> > 3. 2.6.21.5 + sched-cfs-v2.6.21.5-v17.patch on 32-bit kernel on
> NUMA-Q.
>
> Some imbalance:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2263 root 39 19 0 0 0 R 92.4 0.0 2:19.69 rcu_torture_rea
> 2265 root 39 19 0 0 0 R 49.8 0.0 1:41.84 rcu_torture_rea
> 2264 root 39 19 0 0 0 R 49.5 0.0 2:11.69 rcu_torture_rea
> 2261 root 39 19 0 0 0 R 48.8 0.0 2:09.95 rcu_torture_rea
> 2262 root 39 19 0 0 0 R 48.8 0.0 3:01.42 rcu_torture_rea
> 2266 root 39 19 0 0 0 R 30.1 0.0 1:47.02 rcu_torture_rea
> 2260 root 39 19 0 0 0 R 29.8 0.0 2:10.07 rcu_torture_rea
> 2267 root 39 19 0 0 0 R 29.8 0.0 1:57.34 rcu_torture_rea
>
> elm3b132:~# for ((i=2260;i<=2267;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
> 0 1 3 1 2 2 2 0
>
> Has persisted (with some shuffling of CPUs, see below) for about five
> minutes, will let it run for an hour or so to see if it is really serious
> about this.
>
> 3 0 0 2 1 0 2 3

And when I returned after an hour, it had straightened itself out:
1 3 1 2 2 0 3 0

The 64-bit AMD 4-CPU machines have not straightened themselves out
in the past, but will try an extended run over the weekend to see
if load balancing is just a bit on the slow side. ;-)

But got distracted for an additional hour, and it is imbalanced again:

2 1 2 0 1 1 3 3

Strange...

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/