Re: [RFT][patch 17/18] sched: use jump labels to reduce overheadwhen bandwidth control is inactive

From: Jason Baron
Date: Wed Jul 27 2011 - 17:58:43 EST

Next message: Andi Kleen: "[PATCH] [57/99] exec: delay address limit change until point of no return"
Previous message: Andi Kleen: "[PATCH] [58/99] netfilter: IPv6: initialize TOS field in REJECT target module"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jul 21, 2011 at 06:38:01PM -0700, Paul Turner wrote:
> On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@xxxxxxxxxx> wrote:
> > On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote:
> >> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@xxxxxxxxxx> wrote:
> >> > rth@xxxxxxxxxx
> >> > Bcc:
> >> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
> >> > when bandwidth control is inactive
> >> > Reply-To:
> >> > In-Reply-To: <20110721184758.403388616@xxxxxxxxxx>
> >> >
> >> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> >> >> So I'm seeing some strange costs associated with jump_labels; while on paper
> >> >> the branches and instructions retired improves (as expected) we're taking an
> >> >> unexpected hit in IPC.
> >> >>
> >> >> [From the initial mail we have workloads:
> >> >> mkdir -p /cgroup/cpu/test
> >> >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
> >> >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> >> >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
> >> >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> >> >> ]
> >> >>
> >> >> To make some of the figures more clear:
> >> >>
> >> >> Legend:
> >> >> !BWC = tip + bwc, BWC compiled out
> >> >> BWC = tip + bwc
> >> >> BWC_JL = tip + bwc + jump label (this patch)
> >> >>
> >> >>
> >> >> Now, comparing under W1 we see:
> >> >> W1: BWC vs BWC_JL
> >> >> instructions cycles branches elapsed
> >> >> ---------------------------------------------------------------------------------------------------------------------
> >> >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline]
> >> >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel]
> >> >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel]
> >> >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel]
> >> >>
> >> >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline]
> >> >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel]
> >> >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel]
> >> >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel]
> >> >>
> >> >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline]
> >> >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel]
> >> >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel]
> >> >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel]
> >> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> >> >> the unconstrained case with BWC.
> >> >>
> >> >>
> >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> >> >> measurements for BWC_JL, with (%d) being the relative difference to their
> >> >> BWC counterparts.
> >> >>
> >> >> W1: BWC vs BWC_JL is very similar.
> >> >> BWC vs BWC_JL
> >> >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653
> >> >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel]
> >> >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel]
> >> >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel]
> >> >>
> >> >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049
> >> >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel]
> >> >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel]
> >> >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel]
> >> >>
> >> >> westmere [BWC] 918633403 896047900 166496917 0.754629182
> >> >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel]
> >> >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel]
> >> >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel]
> >> >>
> >> >> Now this is rather odd, almost across the board we're seeing the expected
> >> >> drops in instructions and branches, yet we appear to be paying a heavy IPC
> >> >> price. The fact that wall-time has scaled equivalently with cycles roughly
> >> >> rules out the cycles counter being off.
> >> >>
> >
> > if i understand your results, for barcelona you did see an improvement
> > in cycles and eslapsed time with jump labels for unconstrained?
> >
>
> Under W2, yes.
>
> >> >> We are seeing the expected behavior in the bandwidth enabled case;
> >> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> >> >> and instruction which shows up on all the numbers above.
> >> >>
> >> >> With respect to compiler mangling the text is essentially unchanged in size.
> >> >> One lurking suspicion is whether the inserted nops have perturbed some of the
> >> >> jmp/branch alignments?
> >
> > hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who
> > worked on the 'asm goto' in gcc.
> >
> >> >>
> >> >> text data bss dec hex filename
> >> >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label
> >> >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label
> >> >>
> >
> > the other thing here is that vmlinux.jump_label includes the extra
> > kernel/jump_label.o file, so you can sort of subtract the text size of
> > that file to do a fair comparison.
>
> Even without doing that it's only a 1.00004% change in text size.
>
> I was just making the inference that if it's gcc mangling it's likely
> in the layout/alignment.
>
> >
> > Also, I would have expected the data section to have increased more with
> > jump labels enabled. Are tracepoints disabled (a current user of jump
> > labels).
>
> Yeah -- Tracing is enabled so the BWC build should have labels
> already; this likely accounts for the small increase noted above.
>
> >
> >> >> I have checked to make sure that the right instructions are being patched in
> >> >> at run-time. I've also pulled a fully patched jump_label out of the kernel
> >> >> into a userspace test (and benchmarked it directly under perf). The results
> >> >> here are also exactly as expected.
> >> >>
> >> >> e.g.
> >> >> Performance counter stats for './jump_test':
> >> >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> >> >> Performance counter stats for './jump_test 1':
> >> >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> >> >>
> >
> > what no-op did you use in userspace? I wouldn't think the no-op choice
> > would make any difference though...At compile time we use a 'jmp 0', and
> > then at boot we dynamically patch the 'jmp 0' with the no-op we think works
> > best...
> >
>
> Sorry -- what I meant here is I pulled the run-time chosen "best" nop
> out of /proc/kcore and tested a
> tight loop about a <JL><RET><COND><RET> sequence (e.g.
> cfs_rq_throttled()) with JL being the nop and jmp respectively.
>
> Specifically for Westmere this ends up being K8_NOP5 -- 0x666666D0
>
> > thanks,
> >
> > -Jason
> >
> >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case
> >> >> looks really good.
> >> >>
> >> >> Any thoughts Jason?
> >> >>
> >> >
> >> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
> >> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
> >> > more optimal.
> >> >
> >>
> >> Ah I should have mentioned that was one of the holes I stared down:
> >>
> >> Builds were -O2 (gcc-4.6.1) and
> >> $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
> >> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
> >>
> >> Same kernel image across all platforms.
> >>
> >>

Hi Paul,

Ok, I think I finally tracked this down. It may seem a bit crazy, but
when we are getting down to cycle counting like this, it seems that the
link order in the kernel/Makefile can make difference. I had the
jump_label.o listed after the core files, whereas all the code in
jump_label.o is really slow path code (used when toggling branch
values). As follows:

--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
- async.o range.o jump_label.o
+ async.o range.o
obj-y += groups.o

ifdef CONFIG_FUNCTION_TRACER
@@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
+obj-$(CONFIG_JUMP_LABEL) += jump_label.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@xxxxxxxxxxxxxxxx>, the -fno-omit-frame-pointer is

I've tested the patch using a single 'static_branch()' in the getppid() path,
and basically running tight loops of calls to getppid(). Before, the
patch, I was seeing results similar to what you reported, after the
patch, things improved for all metrics. Here are my results for the
branch disabled case:

With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:

Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):

3,969,510,217 instructions # 0.864 IPC ( +-0.000% )
4,592,334,954 cycles ( +- 0.046% )
751,634,470 branches ( +- 0.000% )

1.722635797 seconds time elapsed ( +- 0.046% )

Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:

Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):

4,009,611,846 instructions # 0.867 IPC ( +-0.000% )
4,622,210,580 cycles ( +- 0.012% )
771,662,904 branches ( +- 0.000% )

1.734341454 seconds time elapsed ( +- 0.022% )

So all of the measured metrics improved in the jump labels case b/w
0.5% - 2.5%.

I'm curious to see what you find with this patch.

Thanks,

-Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andi Kleen: "[PATCH] [57/99] exec: delay address limit change until point of no return"
Previous message: Andi Kleen: "[PATCH] [58/99] netfilter: IPv6: initialize TOS field in REJECT target module"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]