Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

From: Ingo Molnar
Date: Fri May 01 2015 - 11:59:41 EST



* Rik van Riel <riel@xxxxxxxxxx> wrote:

> > I.e. what's the baseline we are talking about?
>
> It's an astounding difference. This is not a kernel without
> nohz_full, just a CPU without nohz_full running the same kernel I
> tested with yesterday:
>
> run time system time
> vanilla 5.49s 2.08s
> __acct patch 5.21s 1.92s
> both patches 4.88s 1.71s
> CPU w/o nohz 3.12s 1.63s <-- your numbers, mostly
>
> What is even more interesting is that the majority of the time
> difference seems to come from _user_ time, which has gone down from
> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on
> the CPU without nohz_full enabled...
>
> At syscall entry time, the nohz_full context tracking code is very
> straightforward. We check thread_info->flags &
> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which
> handles USER -> KERNEL context transition.
>
> Syscall exit time is a convoluted mess. Both do_notify_resume and
> syscall_trace_leave call exit_user() on entry and enter_user() on
> exit, leaving the time spent looping around between int_with_check
> and syscall_return: in entry_64.S accounted as user time.
>
> I sent an email about this last night, it may be useful to add a
> third test & function call point to the syscall return code, where
> we can call user_enter() just ONCE, and remove the other context
> tracking calls from that loop.

So what I'm wondering about is the big picture:

- This is crazy big overhead in something as fundamental as system
calls!

- We don't even have the excuse of the syscall auditing code, which
kind of has to run for every syscall if it wants to do its job!

- [ The 'precise vtime' stuff that is driven from syscall entry/exit
is crazy, and I hope not enabled in any distro. ]

- So why are we doing this in every syscall time at all?

Basically the whole point of user-context tracking is to be able to
flush pending RCU callbacks. But that's crazy, we can sure defer a few
kfree()s on this CPU, even indefinitely!

If some other CPU does a sync_rcu(), then it can very well pluck those
callbacks from this super low latency CPU's RCU lists (with due care)
and go and free stuff itself ... There's no need to disturb this CPU
for that!

If user-space does not do anything kernel-ish then there won't be any
new RCU callbacks piled up, so it's not like it's a resource leak
issue either.

So what's the point? Why not remove this big source of overhead
altogether?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/