Re: [RFC GIT PULL] softirq: Consolidation and stack overrun fix

From: Benjamin Herrenschmidt
Date: Sun Sep 22 2013 - 18:39:25 EST


On Sun, 2013-09-22 at 15:22 -0700, Linus Torvalds wrote:
> On Sun, Sep 22, 2013 at 2:56 PM, Benjamin Herrenschmidt
> <benh@xxxxxxxxxxxxxxxxxxx> wrote:
> > On Sun, 2013-09-22 at 18:24 +0200, Peter Zijlstra wrote:
> >>
> >> We use a segment offset. Something like:
> >>
> >> inc %gs:var;
> >>
> >
> > And gcc makes no stupid assumptions that this gs doesn't change ? That's
> > the main problem we have with using r13 for PACA.
>
> Since gcc doesn't really know about segment registers at all (modulo
> %fs as TLS on x86), we do everything like that using inline asm.
>
> It's not *too* painful if you have a number of macro helpers to build
> up all the different versions.
>
> And r13 isn't volatile if you are preempt-safe, so I'm wondering if
> you could just make the preempt disable code mark %r13 as modified
> ("+r"). Then gcc won't ever cache r13 across one of those. And if you
> don't have preemption disabled, then you cannot do multiple ops using
> %r13 anyway, since on a load-store architecture it might change even
> between the load and store, so a per-cpu "add" operation *has* to
> cache the %r13 value in *another* register anyway, because using
> memory ops with just an offset off %r13 would be buggy.

The only other one is our soft-disable of irqs which is a byte off r13,
so technically a simple:

li r0,0
stb r0,PACASOFTIRQEN(r13)

Is a perfectly valid implementation of local_irq_disable() ;-) But that
sort of special case I think we can handle with special cases.

Right, using an r13 clobber in a dummy asm in things like
preempt_disable and local_irq_disable is so far the most interesting
option left on our table.

> So I don't think this is a gcc issue. gcc can't fix those kinds of problems.

Well, it would help if we had a way to tell gcc that a global register
variable is volatile. But it doesn't understand that concept.

> Personally, I'd suggest something like:
>
> - the paca stuff is just insane. Try to get rid of it.

I wish I could... it dig deep. The problem is fundamentally because with
MMU off (aka "real" mode), under a hypervisor, we have only access to a
portion of the partition memory (it's a fake real mode if you want, and
on older systems it's implemented as a physically contiguous chunk of
memory reserved by the hypervisor for the partition, and as such cannot
be arbitrarily large, newer CPUs support something smarter but with
still with some limitations).

That combined with the fact that all interrupts are delivered in real
mode, and you get the picture. KVM "userspace" (the variant that
emulates the guest as a user process) makes it worse as it needs to
maintain much more state in a place that is "real mode accessible".

We *might* be bale to repurpose r13 as our per-cpu offset and move a
chunk of stuff from the paca to per-cpu variables, but we will still
need "something" akin to the PACA for the guts of the exception
entry/exit and KVM.

> - use %r13 for the per-thread thread-info pointer instead. A
> per-thread pointer is *not* volatile like the per-cpu base is.

Yes, use it as a TLS, that would be an option I've been considering. It
could also be the task struct but thread_info might be slightly hotter.

That's what I had in mind when I mentioned using it as a TLS for
"current".

> - Now you can make the per-cpu offset be loaded off the per-thread
> pointer (update it at context switch). gcc knows to not cache it
> across function calls, since it's a memory access. Use ACCESS_ONCE()
> or something to make sure it's only loaded once for the cpu offset
> ops.

I have seen gcc completely disregard ACCESS_ONCE in the past though, at
least in the context of PACA relative accesses...

> Alternatively, make %r13 point to the percpu side, but make sure that
> you always use an asm accessor to fetch the value. In particular, I
> think you need to make __my_cpu_offset be an inline asm that fetches
> %r13 into some other register. Otherwise you can never get it right.

Yes, that's my plan B, at least for PACA but it would work for per-cpu
as well. Use a get_paca/put_paca style construct which fetches is in
another register (and does the preempt enable/disable while at it) and
for the handful of cases where we do want direct and faster access such
as manipulating the soft-irq flag, an inline asm load or store which is
what we do today.

Also with LE, we still have a bit of potential for tweaking the ABI and
the compiler, so that's something I want to put on the table.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/