[RFC] Design for flag bit outputs from asms

From: Richard Henderson
Date: Mon May 04 2015 - 15:33:56 EST

Next message: YesGrowth Loans: "lening"
Previous message: Stephane Eranian: "Re: perf: fuzzer triggers NULL pointer derefreence in x86_schedule_events"
In reply to: Richard Henderson: "Re: [PATCH] x86: Optimize variable_test_bit()"
Next in thread: H. Peter Anvin: "Re: [RFC] Design for flag bit outputs from asms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 05/02/2015 05:39 AM, Peter Zijlstra wrote:
> static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr)
> {
> bool oldbit;
>
> asm volatile ("btr %2, %1"
> : "CF" (oldbit), "+m" (*addr)
> : "Ir" (nr));
>
> return oldbit;
> }
>
> Be the far better solution for this? Bug 59615 comment 7 states that
> they actually modeled the flags in the .md file, so the above should be
> possible to implement.
>
> Now GCC can decide to use "sbb %0, %0" to convert CF into a register
> value or use "jnc" / "jc" for branches, depending on what
> __test_and_clear_bit() was used for.
>
> We don't have to (ab)use asm goto for these things anymore; furthermore
> I think the above will naturally work with our __builtin_expect() hints,
> whereas the asm goto stuff has a hard time with that (afaik).
>
> That's not to say output operants for asm goto would not still be useful
> for other things (like your EXTABLE example).
>

(0) The C level output variable should be an integral type, from bool on up.

The flags are a scarse resource, easily clobbered. We cannot allow user code
to keep data in the flags. While x86 does have lahf/sahf, they don't exactly
perform well. And other targets like arm don't even have that bad option.

Therefore, the language level semantics are that the output is a boolean store
into the variable with a condition specified by a magic constraint.

That said, just like the compiler should be able to optimize

void bar(int y)
{
int x = (y <= 0);
if (x) foo();
}

such that we only use a single compare against y, the expectation is that
within a similarly constrained context the compiler will not require two tests
for these boolean outputs.

Therefore:

(1) Each target defines a set of constraint strings,

E.g. for x86, wherein we're almost out of constraint letters,

ja aux carry flag
jc carry flag
jo overflow flag
jp parity flag
js sign flag
jz zero flag

E.g. for arm/aarch64 (using "j" here, but other possibilities exist):

jn negative flag
jc carry flag
jz zero flag
jv overflow flag

E.g. for s390x (I've thought less about what's useful here)

j<m> where m is a hex digit, and is the mask of CC values
for which the condition is true; exactly corresponding
to the M1 field in the branch on condition instruction.

(2) A new target hook post-processes the asm_insn, looking for the
new constraint strings. The hook expands the condition prescribed
by the string, adjusting the asm_insn as required.

E.g.

bool x, y, z;
asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : );

originally

(parallel [
(set (reg:QI 83 [ x ])
(asm_operands/v:QI ("xyzzy") ("=jc") 0 []
[]
[] z.c:4))
(set (reg:QI 84 [ y ])
(asm_operands/v:QI ("xyzzy") ("=jp") 1 []
[]
[] z.c:4))
(set (reg:QI 85 [ z ])
(asm_operands/v:QI ("xyzzy") ("=jo") 2 []
[]
[] z.c:4))
(clobber (reg:QI 18 fpsr))
(clobber (reg:QI 17 flags))
])

becomes

(parallel [
(set (reg:CC 17 flags)
(asm_operands/v:CC ("xyzzy") ("=j_") 0 []
[]
[] z.c:4))
(clobber (reg:QI 18 fpsr))
])
(set (reg:QI 83 [ x ])
(ne:QI (reg:CCC 17 flags) (const_int 0)))
(set (reg:QI 84 [ y ])
(ne:QI (reg:CCP 17 flags) (const_int 0)))
(set (reg:QI 85 [ z ])
(ne:QI (reg:CCO 17 flags) (const_int 0)))

which ought to assemble to something like

xyzzy
setc %dl
setp %cl
seto %r15l

Note that rtl level data flow is preserved via the flags hard register,
and the lifetime of flags would not extended any further than we would
for a normal cstore pattern.

Note that the output constraints are adjusted to a single internal "=j_"
which would match the flags register in any mode. We can collapse
several output flags to a single set of the flags hard register.

(3) Note that ppc is both easier and more complicated.

There we have 8 4-bit registers, although most of the integer
non-comparisons only write to CR0. And the vector non-comparisons
only write to CR1, though of course that's of less interest in the
context of kernel code.

For the purposes of cr0, the same scheme could certainly work, although
the hook would not insert a hard register use, but rather a pseudo to
be allocated to cr0 (constaint "x").

That said, it's my understanding that "dot insns", setting cr0 are
expensive in current processor generations. There's also a lot less
of the x86-style "operate and set a flag based on something useful".

Can anyone think of any drawbacks, pitfalls, or portability issues to less
popular targets that I havn't considered?

r~
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: YesGrowth Loans: "lening"
Previous message: Stephane Eranian: "Re: perf: fuzzer triggers NULL pointer derefreence in x86_schedule_events"
In reply to: Richard Henderson: "Re: [PATCH] x86: Optimize variable_test_bit()"
Next in thread: H. Peter Anvin: "Re: [RFC] Design for flag bit outputs from asms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]