Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree

From: Alexey Dobriyan
Date: Fri Mar 13 2015 - 08:49:54 EST


On Thu, Mar 12, 2015 at 12:54 AM, <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> Subject: lib/vsprintf.c: even faster binary to decimal conversion

I spent some time to microbenchmark changes in userspace (audience: fool!).
Results are below.

Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
Great care was taken to remove interrupt noise.

Number of measurements is 100 millions per line.
CPU is Intel Core 2 Duo E6550 in 64-bit mode.

3.19.1:

0 98.015369 +- 0.512937 91-616
42 116.000193 +- 3.523826 112-868
27182 137.009008 +- 3.515725 133-1043
65535 137.008262 +- 3.521761 133-840
4294967295 201.019966 +- 3.278608 196-1050
3141592653589793238 289.996882 +- 3.489376 287-1148
18446744073709551615 295.065274 +- 2.860187 287-1029
-----------------------------------------------------
3.19.1+patch
0 94.444063 +- 3.518922 84-630
42 116.428533 +- 18.539093 105-1036
42 116.316904 +- 18.234484 105-833
27182 136.172398 +- 3.737113 133-980
65535 136.014742 +- 3.537882 133-714
4294967295 172.009618 +- 3.507473 168-826
3141592653589793238 207.001114 +- 3.492724 196-1120
18446744073709551615 208.018154 +- 3.220185 203-1246
-----------------------------------------------------

New code is somewhat faster for huge numbers.
But top and ps don't show huge numbers normally --
it is either PIDs (2^16) or moderately high numbers in a range of millions
(see /proc/stat)

* variance for new code is bigger
I even tried N=42 twice because I thought 18.5 variance is a glitch
but it is not.

New code uses lookup table which implies cache misses.
Current code is purely code.

So I don't think new printing code will change anything really.

> On a larger scale, perf shows that top, one of the big consumers of /proc
> data, uses 0.5-1.0% fewer cpu cycles.

perf(1) also shows variance next to average, what was it?
In my experience everything perf measures has single digit percent variance
(and this is just 1sigma!) so you can't say new code is faster.
Also average can vary between runs more than variance (yuck!)

First number printing improvement patch was measuring ~30% speedups:
commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2

Now it is 1%.

> Microbenchmarking shows improvements
> ranging from -50% (for numbers uniformly distributed in [0, 2^64-1]) to
> -25% (for numbers heavily biased toward the smaller end, a more realistic
> distribution).

-25%? Mmm, no, see table above.

I think any further improvements to number printing code should be rejected
on philosophical grounds:

Kernel should ship numbers to ps(1) and top(1) in BINARY,
so it would take exactly 1 MOV instruction which takes exactly 1 cycle
to execute.
Currently it is 1) kernel converts binary to text, 2) usespace
converts text to binary,
3) userspace converts binary to text and shows the user. 4) people optimizing #1

But only final conversion is needed to happen because it is communication
between human and program. Programs can very well talk in binary.

So, if you really want to speed up top(1), design binary interface for
shipping numbers
and enjoy 1000% speedups, leave text files in /proc for those
undemanding shell scripts.

Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/