Re: [performance regression, bisected] scheduler:should_we_balance() kills filesystem performance

From: Joonsoo Kim
Date: Tue Sep 10 2013 - 00:47:46 EST


On Tue, Sep 10, 2013 at 02:02:54PM +1000, Dave Chinner wrote:
> Hi folks,
>
> I just updated my performance test VM to the current 3.12-git
> tree after the XFS dev branch was merged. The first test I ran
> which was a 16-way concurrent fsmark test to create lots of files
> gave me a number about 30% lower than I expected - ~180k files/s
> when I was expecting somewhere around 250k files/s.
>
> I did a bisect, and the bisect landed on this commit:
>
> commit 23f0d2093c789e612185180c468fa09063834e87
> Author: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> Date: Tue Aug 6 17:36:42 2013 +0900
>
> sched: Factor out code to should_we_balance()
>
> Now checking whether this cpu is appropriate to balance or not
> is embedded into update_sg_lb_stats() and this checking has no direct
> relationship to this function. There is not enough reason to place
> this checking at update_sg_lb_stats(), except saving one iteration
> for sched_group_cpus.
> ....
>
> Now, i couldn't revert that patch by itself, but I reverted the
> series of about 10 scheduler patches in that series total from a
> current TOT and the regression went away. Hence I'm pretty confident
> that the this is the patch causing the issue as i've verified it in
> more than one way and the difference between "good" and "bad" was
> signficantlt greater than the variance of the test (1.5-2 stddev
> difference).
>
> In more detail:
>
> v4 filesystem v5 filesystem
> 3.11+xfsdev: 220k files/s 225k files/s
> 3.12-git 180k files/s 185k files/s
> 3.12-git-revert 245k files/s 247k files/s
>
> The test vm is a 16p/16GB RAM VM, with a sparse 100TB filesystem
> image sitting on a 4-way RAID0 SSD array formatted with XFS and the
> image file is accessed by virtio+direct IO. The fsmark command line
> is:
>
> time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \
> -d /mnt/scratch/0 -d /mnt/scratch/1 \
> -d /mnt/scratch/2 -d /mnt/scratch/3 \
> -d /mnt/scratch/4 -d /mnt/scratch/5 \
> -d /mnt/scratch/6 -d /mnt/scratch/7 \
> -d /mnt/scratch/8 -d /mnt/scratch/9 \
> -d /mnt/scratch/10 -d /mnt/scratch/11 \
> -d /mnt/scratch/12 -d /mnt/scratch/13 \
> -d /mnt/scratch/14 -d /mnt/scratch/15 \
> | tee >(stats --trim-outliers | tail -1 1>&2)
>
> The workload on XFS runs to almost being CPU bound - the effect of
> the above patch was that there was a lot of idle time left in the
> system. The workload consumed the same amount of user and system
> CPU, just instantaneous CPU usage was reduced by 20-30% and the
> elaspsed time was increased by 20-30%.

Hello, Dave.

Now, I look again this patch and find one mistake.
If we find that we are appropriate cpu for balancing, should_we_balance()
should return 1. But current code doesn't do so. This correspond with
your observation that a lot of idle time left.

Could you re-test your benchmark with below?

Thanks.

------------------->8-------------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f0a5e6..9b3fe1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5151,7 +5151,7 @@ static int should_we_balance(struct lb_env *env)
* First idle cpu or the first cpu(busiest) in this sched group
* is eligible for doing load balancing at this and above domains.
*/
- return balance_cpu != env->dst_cpu;
+ return balance_cpu == env->dst_cpu;
}

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/