Re: [PATCH 0/4] -ffreestanding/-fno-builtin-* patches

From: Clement Courbet
Date: Wed Aug 19 2020 - 08:19:30 EST


On Tue, Aug 18, 2020 at 9:58 PM Nick Desaulniers <ndesaulniers@xxxxxxxxxx> wrote:
On Tue, Aug 18, 2020 at 12:25 PM Nick Desaulniers
<ndesaulniers@xxxxxxxxxx> wrote:
>
> On Tue, Aug 18, 2020 at 12:19 PM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > And honestly, a compiler that uses 'bcmp' is just broken. WTH? It's
> > the year 2020, we don't use bcmp. It's that simple. Fix your damn
> > broken compiler and use memcmp. The argument that memcmp is more
> > expensive than bcmp is garbage legacy thinking from four decades ago.
> >
> > It's likely the other way around, where people have actually spent
> > time on memcmp, but not on bcmp.
> >
> >                Linus
>
> You'll have to ask Clement about that.  I'm not sure I ever saw the
> "faster bcmp than memcmp" implementation, but I was told "it exists"
> when I asked for a revert when all of our kernel builds went red.

If **is** possible to make bcmp much faster then memcmp. We have one
such implementation internally (it's scheduled to be released as part of
llvm-libc some time this year), but most libc implementations just alias to
memcmp.

Below is a graph showing the impact of releasing this compiler optimization
with our optimized bcmp on the google fleet (the cumulative memcmp+bcmp usage
of all programs running on google datacenters, including the kernel). Scale and
dates have been redacted for obvious reasons, but note that the graph starts at
y=0, so you can compare the values relative to each other. Note how as memcmp
is progressively being replaced by bcmp (more and more programs being
recompiled with the compiler patch), the cumulative usage of memory
comparison drops significantly.
 
https://drive.google.com/file/d/1p8z1ilw2xaAJEnx_5eu-vflp3tEOv0qY/view?usp=sharing

The reasons why bcmp can be faster are:
 - typical libc implementations use the hardware to its full capacity, e.g. for
bcmp we can use vector loads and compares, which can process up to 64 bytes
(avx512) in one instruction. It's harder to implement memcmp with these for
little-endian architectures as there is no vector bswap. Because the kernel
only uses GPRs I can see how that might not perfectly fit the kernel use case.
But the kernel really is a special case, the compiler is written for most
programs, not specifically for the kernel, and most programs should benefit from
this optimization.
 - bcmp() does not have to look at the bytes in order, e.g. it can look at the
first and last . This is useful when comparing buffers that have common
prefixes (as happens in mostly sorted containers, and we have data that shows
that this is a quite common instance).
 

> Also, to Clement's credit, every patch I've ever seen from Clement is
> backed up by data; typically fleetwide profiles at Google.  "we spend
> a lot of time in memcmp, particularly comparing the result against
> zero and no other value; hmm...how do we spend less time in
> memcmp...oh, well there's another library function with slightly
> different semantics we can call instead."  I don't think anyone would
> consider the optimization batshit crazy given the number of cycles
> saved across the fleet.  That an embedded project didn't provide an
> implementation, is a footnote that can be fixed in the embedded
> project, either by using -ffreestanding or -fno-builtin-bcmp, which is
> what this series proposes to do.