Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

From: Ingo Molnar
Date: Fri May 01 2015 - 14:40:39 EST



* Rik van Riel <riel@xxxxxxxxxx> wrote:

> On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> >
> > * Rik van Riel <riel@xxxxxxxxxx> wrote:
> >
> >>> I can understand people running hard-RT workloads not wanting to
> >>> see the overhead of a timer tick or a scheduler tick with variable
> >>> (and occasionally heavy) work done in IRQ context, but the jitter
> >>> caused by a single trivial IPI with constant work should be very,
> >>> very low and constant.
> >>
> >> Not if the realtime workload is running inside a KVM guest.
> >
> > I don't buy this:
> >
> >> At that point an IPI, either on the host or in the guest, involves a
> >> full VMEXIT & VMENTER cycle.
> >
> > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1
> > usec on recent hardware, and I bet it will get better with time.
> >
> > I'm not aware of any hard-RT workload that cannot take 1 usec
> > latencies.
>
> Now think about doing this kind of IPI from inside a guest, to
> another VCPU on the same guest.
>
> Now you are looking at VMEXIT/VMENTER on the first VCPU,

Does it matter? It's not the hard-RT CPU, and this is a slowpath of
synchronize_rcu().

> plus the cost of the IPI on the host, plus the cost of the emulation
> layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI
> work, and possibly a second VMEXIT/VMENTER for IPI completion.

Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies.

> I suspect it would be better to do RCU callback offload in some
> other way.

Well, it's not just about callback offload, but it's about the basic
synchronization guarantee of synchronize_rcu(): that all RCU read-side
critical sections have finished executing after the call returns.

So even if a nohz-full CPU never actually queues a callback, it needs
to stop using resources that a synchronize_rcu() caller expects it to
stop using.

We can do that only if we know it in an SMP-coherent way that the
remote CPU is not in an rcu_read_lock() section.

Sending an IPI is one way to achieve that.

Or we could do that in the syscall path with a single store of a
constant flag to a location in the task struct. We have a number of
natural flags that get written on syscall entry, such as:

pushq_cfi $__USER_DS /* pt_regs->ss */

That goes to a constant location on the kernel stack. On return from
system calls we could write 0 to that location.

So the remote CPU would have to do a read of this location. There are
two cases:

- If it's 0, then it has observed quiescent state on that CPU. (It
does not have to be atomics anymore, as we'd only observe the value
and MESI coherency takes care of it.)

- If it's not 0 then the remote CPU is not executing user-space code
and we can install (remotely) a TIF_NOHZ flag in it and expect it
to process it either on return to user-space or on a context
switch.

This way, unless I'm missing something, reduces the overhead to a
single store to a hot cacheline on return-to-userspace - which
instruction if we place it well might as well be close to zero cost.
No syscall entry cost. Slow-return cost only in the (rare) case of
someone using synchronize_rcu().

Hm?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/