Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation

From: Eric Dumazet
Date: Wed May 27 2015 - 10:53:39 EST

Next message: Greg Kroah-Hartman: "Re: [PATCH v2] staging: lustre: fix non-static symbol warnings reported by sparse"
Previous message: Sudeep Holla: "Re: [PATCH v3 1/5] Documentation: add DT binding for ARM System Control and Power Interface(SCPI) protocol"
In reply to: Peter Zijlstra: "Re: [patch 4/7] timer: Replace timer base by a cpu index"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote:
> I still have a couple of patches against the timer code in my review
> list, but the more I look at them the more horrible I find them.
>
> All these patches are related to the NOHZ stuff and try to mitigate
> shortcomings with even more bandaids. And of course these bandaids
> cost cycles and are a burden for timer heavy use cases like
> networking.
>
> Sadly enough the NOHZ crowd is happy to throw more and more crap at
> the kernel and none of these people is even thinking about whether
> this stuff could be solved in a different way.
>
> After Eric mentioned in one of the recent discussions that the
> timer_migration sysctl is not a lot of gain, I tried to mitigate at
> least that issue. That caused me to look deeper and I came up with the
> following patch series which:
>
> - Clean up the sloppy catchup timer jiffies stuff
>
> - Replace the hash bucket lists by hlists, which shrinks the wheel
> footprint by 50%
>
> - Replaces the timer base pointer in struct timer_list by a cpu
> index, which shrinks struct timer_list by 4 - 8 bytes depending on
> alignement and architecture.
>
> - Caches nohz_active and timer_migration in the timer per cpu bases,
> so we can avoid going through loops and hoops to figure that out.
>
> This series applies against
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core
>
> With a pretty naive timer test module and a sched other cpu hog on an
> isolated core I verified that I did not wreckage anything. The
> following table shows the resulting CPU time of the hog for the
> various scenarios.
>
> nohz=on nohz=on nohz=off
> timer_migration=on timer_migration=off
>
> Unpatched: 46.57% 46.87% 46.64%
>
> Patched: 47.76% 48.20% 48.73%
>
> Though, I do not trust my numbers, so I really hope that Eric or any
> other power timer wheel user can throw a bunch of tests at it.
>
>
> Now some more rant about the whole NOHZ issues. The timer wheel is
> fundamentally unfriendly for NOHZ because:
>
> - it's impossible to keep reliably track of the next expiring timer
>
> - finding the next expiring timer is in the worst case O(N) and
> involves touching a gazillion of cache lines
>
> The other disaster area is the nohz timer migration mechanism. I think
> it's fundamentally broken.
>
> - We decide at timer enqueue time, on which CPU we queue the timer,
> based on cpu idle heuristics which even fails to recognize that a
> cpu is really busy with interrupt and softirq processing (reported
> by Eric).
>
> - When moving a timer to some "non-idle" CPU we speculate about the
> system situation in the future (at timer expiry time). This kinda
> works for limited scenarios (NOHZ_FULL and constraint power saving
> on mobile devices). But it is pretty much nonsensical for
> everything else. For network heavy workloads it can be even
> outright wrong and dangerous as Eric explained.
>
> So we really need to put a full stop at all this NOHZ tinkering and
> think about proper solutions. I'm not going to merge any NOHZ related
> features^Whacks before the existing problems are not solved. In
> hindsight I should have done that earlier, but it's never too late.
>
> So we need to address two issues:
>
> 1) Keeping track of the first expiring timer in the wheel.
>
> Don't even think about rbtrees or such, it just wont work, but I'm
> willing to accept prove of the contrary.
>
> One of the ideas I had a long time ago is to have a wheel
> implementation, which does never cascade and therefor provides
> different levels of granularity per wheel level.
>
> LVL0 1 jiffy granularity
> LVL1 8 jiffies granularity
> LVL1 64 jiffies granularity
> ....
>
> This requires more levels than the classic timer wheel, so its not
> feasible from a memory consumption POV.
>
> But we can have a compromise and put all 'soon' expiring timers
> into these fancy new wheels and stick all 'far out' timers into the
> last level of the wheel and cascade them occasionally.
>
> That should work because:
>
> - Most timers are short term expiry (< 1h)
> - Most timers are canceled before expiry
>
> So we need a sensible upper bound of levels and get the following
> properties:
>
> - Natural batching (no more slack magic). This might also help
> networking to avoid rearming of timers.
>
> - Long out timers are queued in the slowest wheel and
> ocassionally with the granularity of the slowest wheel
> cascaded
>
> - Tracking the next expiring timer can be done with a bitmap,
> so the 'get next expiring timer' becomes O(1) without
> touching anything else than the bitmap, if we accept that
> the upper limit of what we can retrieve O(1) is the
> granularity of the last level, i.e. we treat timers which
> need recascading simple as normal inhabitants of the last
> wheel level.
>
> - The expiry code (run_timers) gets more complicated as it
> needs to walk through the different levels more often, but
> with the bitmap that shouldnt be a too big issue.
>
> - Seperate storage for non-deferrable and deferrable timers.
>
> I spent some time in coding that up. It barely boots and has
> certainly a few bugs left and right, but I will send it out
> nevertheless as a reply to this mail to get the discussion going.

Thanks a lot Thomas !

I finally got back two hosts with 72 cpus where I can try your patches.

I have one comment about socket TCP timers : They are re-armed very
often, but they are never canceled (unless the socket is dismantled)

I am too lazy to search for the commit adding this stuff (before git),
but I guess it helped a lot RPC sessions : No need to cancel a timer if
we're likely to rearm it in the following 10 usec ;)

$ git grep -n INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Greg Kroah-Hartman: "Re: [PATCH v2] staging: lustre: fix non-static symbol warnings reported by sparse"
Previous message: Sudeep Holla: "Re: [PATCH v3 1/5] Documentation: add DT binding for ARM System Control and Power Interface(SCPI) protocol"
In reply to: Peter Zijlstra: "Re: [patch 4/7] timer: Replace timer base by a cpu index"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]