CFS Bandwidth Control - Test results of cgroups tasks pinned vsunpinned

From: Kamalesh Babulal
Date: Tue Jun 07 2011 - 11:46:05 EST


Hi All,

In our test environment, while testing the CFS Bandwidth V6 patch set
on top of 55922c9d1b84. We observed that the CPU's idle time is seen
between 30% to 40% while running CPU bound test, with the cgroups tasks
not pinned to the CPU's. Whereas in the inverse case, where the cgroups
tasks are pinned to the CPU's, the idle time seen is nearly zero.

Test Scenario
--------------
- 5 cgroups are created with each groups assigned 2, 2, 4, 8, 16 tasks respectively.
- Each of the cgroup, has N sub-cgroups created. Where N is the NR_TASKS the cgroup
is assigned with. i.e., cgroup1, will create two sub-cgroups under it and assigned
one tasks per sub-group.
------------
| cgroup 1 |
------------
/ \
/ \
-------------- --------------
|sub-cgroup 1| |sub-cgroup 2|
| (task 1) | | (task 2) |
-------------- --------------

- Top cgroup is given unlimited quota (cpu.cfs_quota_us = -1) and period of 500ms
(cpu.cfs_period_us = 500000). Whereas the sub-cgroups are given 250ms of quota
(cpu.cfs_quota_us = 250000) and period of 500ms. i.e. the top cgroups are given
unlimited bandwidth, whereas the sub-group are throttled every 250ms.

- Additional if required the proportional CPU shares can be assigned to cpu.shares
as NR_TASKS * 1024. i.e. cgroup1 has 2 tasks * 1024 = 2048 worth cpu.shares
for cgroup1. (In the below test results published all cgroups and sub-cgroups
are given the equal share of 1024).

- One CPU bound while(1) task is attached to each sub-cgroup.

- sum-exec time for each cgroup/sub-cgroup is captured from /proc/sched_debug after
60 seconds and analyzed for the run time of the tasks a.k.a sub-cgroup.

How is the idle CPU time measured ?
------------------------------------
- vmstat stats are logged every 2 seconds, after attaching the last while1 task
to 16th sub-cgroup of cgroup 5 till the 60 sec run is over. After the run idle%
of a CPU is calculated by summing idle column from the vmstat log and dividing it
by number of samples collected, of-course after neglecting the first record
from the log.

How are the tasks pinned to the CPU ?
-------------------------------------
- cgroup is mounted with cpuset,cpu controller and for every 2 sub-cgroups one
physical CPU is allocated. i.e. CPU 1 is allocated between 1/1 and 1/2 (Group 1,
sub-cgroup 1 and sub-cgroup 2). Similarly CPUs 7 to 15 are allocated to 15/1 to
15/16 (Group 15, subgroup 1 to 16). Note that test machine used to test has
16 CPUs.

Result for non-pining case
---------------------------
Only the hierarchy is created as stated above and cpusets are not assigned per cgroup.

Average CPU Idle percentage 34.8% (as explained above in the Idle time measured)
Bandwidth shared with remaining non-Idle 65.2%

* Note: For the sake of roundoff value the numbers are multiplied by 100.

In the below result for cgroup1 9.2500 corresponds to sum-exec time captured
from /proc/sched_debug for cgroup 1 tasks (including sub-cgroup 1 and 2).
Which is in-turn 6% of the non-Idle CPU time (which is derived by 9.2500 * 65.2 / 100 )

Bandwidth of Group 1 = 9.2500 i.e = 6.0300% of non-Idle CPU time 65.2%
|...... subgroup 1/1 = 48.7800 i.e = 2.9400% of 6.0300% Groups non-Idle CPU time
|...... subgroup 1/2 = 51.2100 i.e = 3.0800% of 6.0300% Groups non-Idle CPU time


Bandwidth of Group 2 = 9.0400 i.e = 5.8900% of non-Idle CPU time 65.2%
|...... subgroup 2/1 = 51.0200 i.e = 3.0000% of 5.8900% Groups non-Idle CPU time
|...... subgroup 2/2 = 48.9700 i.e = 2.8800% of 5.8900% Groups non-Idle CPU time


Bandwidth of Group 3 = 16.9300 i.e = 11.0300% of non-Idle CPU time 65.2%
|...... subgroup 3/1 = 26.0300 i.e = 2.8700% of 11.0300% Groups non-Idle CPU time
|...... subgroup 3/2 = 25.8800 i.e = 2.8500% of 11.0300% Groups non-Idle CPU time
|...... subgroup 3/3 = 22.7800 i.e = 2.5100% of 11.0300% Groups non-Idle CPU time
|...... subgroup 3/4 = 25.2900 i.e = 2.7800% of 11.0300% Groups non-Idle CPU time


Bandwidth of Group 4 = 27.9300 i.e = 18.2100% of non-Idle CPU time 65.2%
|...... subgroup 4/1 = 16.6000 i.e = 3.0200% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/2 = 8.0000 i.e = 1.4500% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/3 = 9.0000 i.e = 1.6300% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/4 = 7.9600 i.e = 1.4400% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/5 = 12.3500 i.e = 2.2400% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/6 = 16.2500 i.e = 2.9500% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/7 = 12.6100 i.e = 2.2900% of 18.2100% Groups non-Idle CPU time
|...... subgroup 4/8 = 17.1900 i.e = 3.1300% of 18.2100% Groups non-Idle CPU time


Bandwidth of Group 5 = 36.8300 i.e = 24.0100% of non-Idle CPU time 65.2%
|...... subgroup 5/1 = 56.6900 i.e = 13.6100% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/2 = 8.8600 i.e = 2.1200% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/3 = 5.5100 i.e = 1.3200% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/4 = 4.5700 i.e = 1.0900% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/5 = 7.9500 i.e = 1.9000% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/6 = 2.1600 i.e = .5100% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/7 = 2.3400 i.e = .5600% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/8 = 2.1500 i.e = .5100% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/9 = 9.7200 i.e = 2.3300% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/10 = 5.0600 i.e = 1.2100% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/11 = 4.6900 i.e = 1.1200% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/12 = 8.9700 i.e = 2.1500% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/13 = 8.4600 i.e = 2.0300% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/14 = 11.8400 i.e = 2.8400% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/15 = 6.3400 i.e = 1.5200% of 24.0100% Groups non-Idle CPU time
|...... subgroup 5/16 = 5.1500 i.e = 1.2300% of 24.0100% Groups non-Idle CPU time

Pinned case
--------------
CPU hierarchy is created and cpusets are allocated.

Average CPU Idle percentage 0%
Bandwidth shared with remaining non-Idle 100%

Bandwidth of Group 1 = 6.3400 i.e = 6.3400% of non-Idle CPU time 100%
|...... subgroup 1/1 = 50.0400 i.e = 3.1700% of 6.3400% Groups non-Idle CPU time
|...... subgroup 1/2 = 49.9500 i.e = 3.1600% of 6.3400% Groups non-Idle CPU time


Bandwidth of Group 2 = 6.3200 i.e = 6.3200% of non-Idle CPU time 100%
|...... subgroup 2/1 = 50.0400 i.e = 3.1600% of 6.3200% Groups non-Idle CPU time
|...... subgroup 2/2 = 49.9500 i.e = 3.1500% of 6.3200% Groups non-Idle CPU time


Bandwidth of Group 3 = 12.6300 i.e = 12.6300% of non-Idle CPU time 100%
|...... subgroup 3/1 = 25.0300 i.e = 3.1600% of 12.6300% Groups non-Idle CPU time
|...... subgroup 3/2 = 25.0100 i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
|...... subgroup 3/3 = 25.0000 i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
|...... subgroup 3/4 = 24.9400 i.e = 3.1400% of 12.6300% Groups non-Idle CPU time


Bandwidth of Group 4 = 25.1000 i.e = 25.1000% of non-Idle CPU time 100%
|...... subgroup 4/1 = 12.5400 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/2 = 12.5100 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/3 = 12.5300 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/4 = 12.5000 i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/5 = 12.4900 i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/6 = 12.4700 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/7 = 12.4700 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
|...... subgroup 4/8 = 12.4500 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time


Bandwidth of Group 5 = 49.5700 i.e = 49.5700% of non-Idle CPU time 100%
|...... subgroup 5/1 = 49.8500 i.e = 24.7100% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/2 = 6.2900 i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/3 = 6.2800 i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/4 = 6.2700 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/5 = 6.2700 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/6 = 6.2600 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/7 = 6.2500 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/8 = 6.2400 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/9 = 6.2400 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/10 = 6.2300 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/11 = 6.2300 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/12 = 6.2200 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/13 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/14 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/15 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
|...... subgroup 5/16 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time

with equal cpu shares allocated to all the groups/sub-cgroups and CFS bandwidth configured
to allow 100% CPU utilization. We see the CPU idle time in the un-pinned case.

Benchmark used to reproduce the issue, is attached. Justing executing the script should
report similar numbers.

#!/bin/bash

NR_TASKS1=2
NR_TASKS2=2
NR_TASKS3=4
NR_TASKS4=8
NR_TASKS5=16

BANDWIDTH=1
SUBGROUP=1
PRO_SHARES=0
MOUNT=/cgroup/
LOAD=/root/while1

usage()
{
echo "Usage $0: [-b 0|1] [-s 0|1] [-p 0|1]"
echo "-b 1|0 set/unset Cgroups bandwidth control (default set)"
echo "-s Create sub-groups for every task (default creates sub-group)"
echo "-p create propotional shares based on cpus"
exit
}
while getopts ":b:s:p:" arg
do
case $arg in
b)
BANDWIDTH=$OPTARG
shift
if [ $BANDWIDTH -gt 1 ] && [ $BANDWIDTH -lt 0 ]
then
usage
fi
;;
s)
SUBGROUP=$OPTARG
shift
if [ $SUBGROUP -gt 1 ] && [ $SUBGROUP -lt 0 ]
then
usage
fi
;;
p)
PRO_SHARES=$OPTARG
shift
if [ $PRO_SHARES -gt 1 ] && [ $PRO_SHARES -lt 0 ]
then
usage
fi
;;

*)

esac
done
if [ ! -d $MOUNT ]
then
mkdir -p $MOUNT
fi
test()
{
echo -n "[ "
if [ $1 -eq 0 ]
then
echo -ne '\E[42;40mOk'
else
echo -ne '\E[31;40mFailed'
tput sgr0
echo " ]"
exit
fi
tput sgr0
echo " ]"
}
mount_cgrp()
{
echo -n "Mounting root cgroup "
mount -t cgroup -ocpu,cpuset,cpuacct none $MOUNT &> /dev/null
test $?
}

umount_cgrp()
{
echo -n "Unmounting root cgroup "
cd /root/
umount $MOUNT
test $?
}

create_hierarchy()
{
mount_cgrp
cpuset_mem=`cat $MOUNT/cpuset.mems`
cpuset_cpu=`cat $MOUNT/cpuset.cpus`
echo -n "creating groups/sub-groups ..."
for (( i=1; i<=5; i++ ))
do
mkdir $MOUNT/$i
echo $cpuset_mem > $MOUNT/$i/cpuset.mems
echo $cpuset_cpu > $MOUNT/$i/cpuset.cpus
echo -n ".."
if [ $SUBGROUP -eq 1 ]
then
jj=$(eval echo "\$NR_TASKS$i")
for (( j=1; j<=$jj; j++ ))
do
mkdir -p $MOUNT/$i/$j
echo $cpuset_mem > $MOUNT/$i/$j/cpuset.mems
echo $cpuset_cpu > $MOUNT/$i/$j/cpuset.cpus
echo -n ".."
done
fi
done
echo "."
}

cleanup()
{
pkill -9 while1 &> /dev/null
sleep 10
echo -n "Umount groups/sub-groups .."
for (( i=1; i<=5; i++ ))
do
if [ $SUBGROUP -eq 1 ]
then
jj=$(eval echo "\$NR_TASKS$i")
for (( j=1; j<=$jj; j++ ))
do
rmdir $MOUNT/$i/$j
echo -n ".."
done
fi
rmdir $MOUNT/$i
echo -n ".."
done
echo " "
umount_cgrp
}

load_tasks()
{
for (( i=1; i<=5; i++ ))
do
jj=$(eval echo "\$NR_TASKS$i")
shares="1024"
if [ $PRO_SHARES -eq 1 ]
then
eval shares=$(echo "$jj * 1024" | bc)
fi
echo $hares > $MOUNT/$i/cpu.shares
for (( j=1; j<=$jj; j++ ))
do
echo "-1" > $MOUNT/$i/cpu.cfs_quota_us
echo "500000" > $MOUNT/$i/cpu.cfs_period_us
if [ $SUBGROUP -eq 1 ]
then

$LOAD &
echo $! > $MOUNT/$i/$j/tasks
echo "1024" > $MOUNT/$i/$j/cpu.shares

if [ $BANDWIDTH -eq 1 ]
then
echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us
echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us
fi
else
$LOAD &
echo $! > $MOUNT/$i/tasks
echo $shares > $MOUNT/$i/cpu.shares

if [ $BANDWIDTH -eq 1 ]
then
echo "500000" > $MOUNT/$i/cpu.cfs_period_us
echo "250000" > $MOUNT/$i/cpu.cfs_quota_us
fi
fi
done
done
echo "Captuing idle cpu time with vmstat...."
vmstat 2 100 &> vmstat_log &
}

pin_tasks()
{
cpu=0
count=1
for (( i=1; i<=5; i++ ))
do
if [ $SUBGROUP -eq 1 ]
then
jj=$(eval echo "\$NR_TASKS$i")
for (( j=1; j<=$jj; j++ ))
do
if [ $count -gt 2 ]
then
cpu=$((cpu+1))
count=1
fi
echo $cpu > $MOUNT/$i/$j/cpuset.cpus
count=$((count+1))
done
else
case $i in
1)
echo 0 > $MOUNT/$i/cpuset.cpus;;
2)
echo 1 > $MOUNT/$i/cpuset.cpus;;
3)
echo "2-3" > $MOUNT/$i/cpuset.cpus;;
4)
echo "4-6" > $MOUNT/$i/cpuset.cpus;;
5)
echo "7-15" > $MOUNT/$i/cpuset.cpus;;
esac
fi
done

}

print_results()
{
eval gtot=$(cat sched_log|grep -i while|sed 's/R//g'|awk '{gtot+=$7};END{printf "%f", gtot}')
for (( i=1; i<=5; i++ ))
do
eval temp=$(cat sched_log_$i|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
eval tavg=$(echo "scale=4;(($temp / $gtot) * $1)/100 " | bc)
eval avg=$(echo "scale=4;($temp / $gtot) * 100" | bc)
eval pretty_tavg=$( echo "scale=4; $tavg * 100"| bc) # F0r pretty format
echo "Bandwidth of Group $i = $avg i.e = $pretty_tavg% of non-Idle CPU time $1%"
if [ $SUBGROUP -eq 1 ]
then
jj=$(eval echo "\$NR_TASKS$i")
for (( j=1; j<=$jj; j++ ))
do
eval tmp=$(cat sched_log_$i-$j|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
eval stavg=$(echo "scale=4;($tmp / $temp) * 100" | bc)
eval pretty_stavg=$(echo "scale=4;(($tmp / $temp) * $tavg) * 100" | bc)
echo -n "|"
echo -e "...... subgroup $i/$j\t= $stavg\ti.e = $pretty_stavg% of $pretty_tavg% Groups non-Idle CPU time"
done
fi
echo " "
echo " "
done
}
capture_results()
{
cat /proc/sched_debug > sched_log
pkill -9 vmstat -c
avg=$(cat vmstat_log |grep -iv "system"|grep -iv "swpd"|awk ' { if ( NR != 1) {id+=$15 }}END{print (id/NR)}')

rem=$(echo "scale=2; 100 - $avg" |bc)
echo "Average CPU Idle percentage $avg%"
echo "Bandwidth shared with remaining non-Idle $rem%"
for (( i=1; i<=5; i++ ))
do
cat sched_log |grep -i while1|grep -i " \/$i" > sched_log_$i
if [ $SUBGROUP -eq 1 ]
then
jj=$(eval echo "\$NR_TASKS$i")
for (( j=1; j<=$jj; j++ ))
do
cat sched_log |grep -i while1|grep -i " \/$i\/$j" > sched_log_$i-$j
done
fi
done
print_results $rem
}
create_hierarchy
pin_tasks

load_tasks
sleep 60
capture_results
cleanup
exit

Thanks,
Kamalesh.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/