Re: [PATCH] static_call,x86: Robustify trampoline patching

From: Kees Cook
Date: Tue Nov 02 2021 - 14:10:15 EST


On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote:
> > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > > How is that not true for the jump table approach? Like I showed earlier,
> > > it is *trivial* to reconstruct the actual function pointer from a
> > > jump-table entry pointer.
> > >
> >
> > That is not the point. The point is that Clang instruments every
> > indirect call that it emits, to check whether the type of the jump
> > table entry it is about to call matches the type of the caller. IOW,
> > the indirect calls can only branch into jump tables, and all jump
> > table entries in a table each branch to the start of some function of
> > the same type.
> >
> > So the only thing you could achieve by adding or subtracting a
> > constant value from the indirect call address is either calling
> > another function of the same type (if you are hitting another entry in
> > the same table), or failing the CFI type check.
>
> Ah, I see, so the call-site needs to have a branch around the indirect
> call instruction.
>
> > Instrumenting the callee only needs something like BTI, and a
> > consistent use of the landing pads to ensure that you cannot trivially
> > omit the check by landing right after it.
>
> That does bring up another point tho; how are we going to do a kernel
> that's optimal for both software CFI and hardware aided CFI?
>
> All questions that need answering I think.

I'm totally fine with designing a new CFI for a future option,
but blocking the existing (working) one does not best serve our end
users. There are already people waiting on x86 CFI because having the
extra layer of defense is valuable for them. No, it's not perfect,
but it works right now, and evidence from Android shows that it has
significant real-world defensive value. Some of the more adventurous
are actually patching their kernels with the CFI support already, and
happily running their workloads, etc.

Supporting Clang CFI means we actually have something to evolve
from, where as starting completely over means (likely significant)
delays leaving folks without the option available at all. I think the
various compiler and kernel tweaks needed to improve kernel support
are reasonable, but building a totally new CFI implementation is not:
it _does_ work today on x86. Yes, it's weird, but not outrageously so.
(And just to state the obvious, CFI is an _optional_ CONFIG: not
everyone wants CFI, so it's okay if there are some sharp edges under
some CONFIG combinations.)

Regardless, speaking to a new CFI design below:

> So how insane is something like this, have each function:
>
> foo.cfi:
> endbr64
> xorl $0xdeadbeef, %r10d
> jz foo
> ud2
> nop # make it 16 bytes
> foo:
> # actual function text goes here
>
>
> And for each hash have two thunks:
>
>
> # arg: r11
> # clobbers: r10, r11
> __x86_indirect_cfi_deadbeef:
> movl -9(%r11), %r10 # immediate in foo.cfi

This requires the text be readable. I have been hoping to avoid this for
a CFI implementation so we could gain the benefit of execute-only
memory (available soon on arm64, and possible today on x86 under a
hypervisor).

But, yes, FWIW, this is very similar to what PaX RAP CFI does: the
caller reads "$dest - offset" for a hash, and compares it against the
hard-coded hash at the call site, before "call $dest".

> xorl $0xdeadbeef, %r10 # our immediate
> jz 1f
> ud2
> 1: ALTERNATIVE_2 "jmp *%r11",
> "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
>
>
>
> # arg: r11
> # clobbers: r10, r11
> __x86_indirect_ibt_deadbeef:
> movl $0xdeadbeef, %r10
> subq $0x10, %r11
> ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE
> jmp *%r11
>
>
>
> And have the actual indirect callsite look like:
>
> # r11 - &foo
> ALTERNATIVE_2 "cs call __x86_indirect_thunk_r11",
> "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI
> "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT
>
> Although if the compiler were to emit:
>
> cs call __x86_indirect_cfi_deadbeef
>
> we could probaly fix it up from there.

It seems like this could work for any CFI implementation, though, if
the CFI implementation always performed a call, or if the bounds of the
inline checking were known? i.e. objtool could also find the inline
checks just as well as the call, though?

> Then we can at runtime decide between:
>
> {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd}

This does look pretty powerful, but I still don't think it precludes
using the existing Clang CFI. I don't want perfect to be the enemy of
good. :)

--
Kees Cook