Re: [PATCH] lib/int_sqrt.c: Optimize square root function

From: Linus Torvalds
Date: Thu Jul 20 2017 - 14:31:44 EST


How did this two-year old thread get resurrected?

Anyway, it got resurrected without even answering one core question:

On Thu, Jul 20, 2017 at 4:24 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote:
>>>> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>> >
>> > (I'm also not entirely sure what uses int_sqrt() that ends up being so
>> > performance-critical, so it would be good to document that too, since
>> > that probably also matters for the "what's the normal argument range"
>> > question..)

This is still the case. Which of the (very few) users really _care_?
And what are the normal values for that?

For example, the 802.11 minstrel code does a "MINSTREL_TRUNC()" on a
"unsigned int" value that is in millions. It's not even "unsigned
long", so we know it's not many thousands of millions, and
MINSTREL_TRUNC shifts it down by 12 bits.

So we know we have at most a 20-bit argument.

The one case that uses actual unsigned long seems to be
"slow_is_prime_number()", and honestly, the sqrt() is the *least* of
our problems there.

There's a few drivers and filesystems that use it. I do not believe
performance matters in those cases. Even if you do a "int_sqrt()" per
nertwork packet on some high-performance network (and none of them
look anything like that).

And there's a couple of VM users. They don't look particularly critical either.

So why do you care? Because honestly, calling int_sqrt() once in a
blue moon with caches cold and no branch prediction information tends
to have very different performance characteristics from calling it in
a loop with very predictable input.

So I think your "benchmark" is just garbage, in that it's testing
something entirely different than the actual load.

Also, since this is a generic library routine, no way can we depend on
fls being fast.

But we could certainly improve on the initial value a lot. It's just
that we should probably strive to improve on it without adding extra
branch misprediction or I$ misses - both things that your benchmark
isn't actually testing at all, since it does the exact opposite of
that by basically preloading both.

And the *most* important question is that first one:

"Why does this matter, and what is the range it matters for?"

Linus