RE: objtool clac/stac handling change..

From: David Laight
Date: Tue Jul 07 2020 - 08:35:28 EST


From: Al Viro
> Sent: 04 July 2020 03:12
...
> BTW, looking at csum_and_copy_{to,from}_user() callers (all 3 of them,
> all in lib/iov_iter.c) we have this:
> 1) len is never 0
> 2) sum (initial value of csum) is always 0
> 3) failure (reported via *err_ptr) is always treateds as "discard
> the entire iovec segment (and possibly the entire iovec)". Exact value
> put into *err_ptr doesn't matter (it's only compared to 0) and in case of
> error the return value is ignored.
>
> Now, using ~0U instead of 0 for initial sum would yield an equivalent csum
> (comparable modulo 2^16-1) *AND* never yield 0 (recall how csum addition works).
>
> IOW, we could simply return 0 to indicate an error. Which gives much saner
> calling conventions:
> __wsum csum_and_copy_from_user(const void __user *src, void *dst, int len)
> copying the damn thing and returning 0 on error or a non-zero value comparable
> to csum of the data modulo 2^16-1 on success. Same for csum_and_copy_to_user()
> (modulo const and __user being on the other argument).
>
> For x86 it simplifies the instances (both the inline wrappers and asm parts);
> I hadn't checked the other architectures yet, but it looks like that should
> be doable for all architectures. And it does simplify the callers...

All the csum functions should let the caller pass in a small value
to be added in (could be over 2^32 on 64 bit systems) since that is
normally 'free' in the algorithm - certainly better than adding it
it at the end - which is what the current x86 code does.
(For 64bit systems the 'small' value can exceed 2^32.)

I also wonder if the csum_and_copy() functions are actually worthwhile on x86.
The csum code can run at 8 bytes/clock on all Intel cpu since ivy bridge.
(It doesn't, it only does 4 bytes/clock until (IIRC) Haswell [1].)
On cpu that support ADCX/ADOX you may do better - probably 12 bytes/clock,
I think 16 bytes/clock is wishful thinking.
But there is no leeway for any extra uops in either case.

However trying to get a memory read, memory write, adc and bits of loop
control scheduled in one clock is probably impossible - even though
it might not exceed the number of uops the execution pipelines can process.
ISTR that just separating the memory read from the adc slows
thing down too much - probably issues with retiring instructions.
So I don't think it can get near 8 bytes/clock.

OTOH a copy operation trivially does 8 bytes/clock.
I even think 'rep movsq' is faster - never mind the fast 'rep movsb'.

So separate copy and checksum passes should easily exceed 4 bytes/clock,
but I suspect that doing them together never does.
(Unless the buffer is too big for the L1 cache.)

[1] The underlying problem is that a uop can only have 2 inputs.
ADC needs three (two values and the carry flag).
So the ADC instruction takes two clocks.
>From ivy bridge (sandy?) the carry flag is available early,
so adding to alternate registers lets you do 1 per clock.
So the existing csum function is rather slower than adding
32bit values to a 64bit register on most older cpus.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)