Re: [PATCH v3 0/6] Static calls

From: Nadav Amit
Date: Fri Jan 11 2019 - 14:33:42 EST


> On Jan 11, 2019, at 11:23 AM, hpa@xxxxxxxxx wrote:
>
> On January 11, 2019 11:03:30 AM PST, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoimboe@xxxxxxxxxx>
>> wrote:
>>>> Now, in the int3 handler can you take the faulting RIP and search
>> for it in
>>>> the âstatic-callsâ table, writing the RIP+5 (offset) into R10
>> (return
>>>> address) and the target into R11. You make the int3 handler to
>> divert the
>>>> code execution by changing pt_regs->rip to point to a new function
>> that does:
>>>> push R10
>>>> jmp __x86_indirect_thunk_r11
>>>>
>>>> And then you are done. No?
>>>
>>> IIUC, that sounds pretty much like what Steven proposed:
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.home&amp;data=02%7C01%7Cnamit%40vmware.com%7Ca665a074940b4630e3fc08d677fa753b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828314949874019&amp;sdata=fs2uo%2BjL%2FV3rpzIHJ%2B3QoyHg4KhV%2B%2FUPmUOLpy8S8p8%3D&amp;reserved=0
>>> I liked the idea, BUT, how would it work for callee-saved PV ops? In
>>> that case there's only one clobbered register to work with (rax).
>>
>> Actually, there's a much simpler model now that I think about it.
>>
>> The BP fixup just fixes up %rip to to point to "bp_int3_handler".
>>
>> And that's just a random text address set up by "text_poke_bp()".
>>
>> So how about the static call rewriting simply do this:
>>
>> - for each static call:
>>
>> 1) create a fixup code stub that does
>>
>> push $returnaddressforTHIScall
>> jmp targetforTHIScall
>>
>> 2) do
>>
>> on_each_cpu(do_sync_core, NULL, 1);
>>
>> to make sure all CPU's see this generated code
>>
>> 3) do
>>
>> text_poke_bp(addr, newcode, newlen, generatedcode);
>>
>> Ta-daa! Done.
>>
>> In fact, it turns out that even the extra "do_sync_core()" in #2 is
>> unnecessary, because taking the BP will be serializing on the CPU that
>> takes it, so we can skip it.
>>
>> End result: the text_poke_bp() function will do the two do_sync_core
>> IPI's that guarantee that by the time it returns, no other CPU is
>> using the generated code any more, so it can be re-used for the next
>> static call fixup.
>>
>> Notice? No odd emulation, no need to adjust the stack in the BP
>> handler, just the regular "return to a different IP".
>>
>> Now, there is a nasty special case with that stub, though.
>>
>> So nasty thing with the whole "generate a stub for each call" case:
>> because it's dynamic and because of the re-use of the stub, you could
>> be in the situation where:
>>
>> CPU1 CPU2
>> ---- ----
>>
>> generate a stub
>> on_each_cpu(do_sync_core..)
>> text_poke_bp()
>> ...
>>
>> rewrite to BP
>> trigger the BP
>> return to the stub
>> fun the first instruction of the stub
>> *INTERRUPT causes rescheduling*
>>
>> on_each_cpu(do_sync_core..)
>> rewrite to good instruction
>> on_each_cpu(do_sync_core..)
>>
>> free or re-generate the stub
>>
>> !! The stub is still in use !!
>>
>> So that simple "just generate the stub dynamically" isn't so simple
>> after all.
>>
>> But it turns out that that is really simple to handle too. How do we do
>> that?
>>
>> We do that by giving the BP handler *two* code sequences, and we make
>> the BP handler pick one depending on whether it is returning to a
>> "interrupts disabled" or "interrupts enabled" case.
>>
>> So the BP handler does this:
>>
>> - if we're returning with interrupts disabled, pick the simple stub
>>
>> - if we're returning with interrupts enabled, clkear IF in the return
>> %rflags, and pick a *slightly* more complex stub:
>>
>> push $returnaddressforTHIScall
>> sti
>> jmp targetforTHIScall
>>
>> and now the STI shadow will mean that this sequence is uninterruptible.
>>
>> So we'd not do complex emulation of the call instruction at BP time,
>> but we'd do that *trivial* change at BP time.
>>
>> This seems simple, doesn't need any temporary registers at all, and
>> doesn't need any extra stack magic. It literally needs just a trivial
>> sequence in poke_int3_handler().
>>
>> The we'd change the end of poke_int3_handler() to do something like
>> this instead:
>>
>> void *newip = bp_int3_handler;
>> ..
>> if (new == magic_static_call_bp_int3_handler) {
>> if (regs->flags &X86_FLAGS_IF) {
>> newip = magic_static_call_bp_int3_handler_sti;
>> regs->flags &= ~X86_FLAGS_IF;
>> }
>> regs->ip = (unsigned long) newip;
>> return 1;
>>
>> AAND now we're *really* done.
>>
>> Does anybody see any issues in this?
>>
>> Linus
>
> I still don't see why can't simply spin in the #BP handler until the patch is complete.
>
> We can't have the #BP handler do any additional patching, as previously discussed, but spinning should be perfectly safe. The simplest way to spin it to just IRET; that both serializes and will re-take the exception if the patch is still in progress.
>
> It requires exactly *no* awareness in the #BP handler, allows for the call to be replaced with inline code or a simple NOP if desired (or vice versa, as long as it is a single instruction.)
>
> If I'm missing something, then please educate me or point me to previous discussion; I would greatly appreciate it.

One thing that comes to mind is that text_poke_bp() runs after patching the
int3 and before patching in the instruction:

on_each_cpu(do_sync_core, NULL, 1);

If IRQs are disabled when the BP is hit, spinning can cause the system to
hang.