[RFD] timer: Proof of concept wheel replacement

From: Thomas Gleixner
Date: Tue May 26 2015 - 19:01:24 EST


On Tue, 26 May 2015, Thomas Gleixner wrote:
> So we need to address two issues:
>
> 1) Keeping track of the first expiring timer in the wheel.
>
> Don't even think about rbtrees or such, it just wont work, but I'm
> willing to accept prove of the contrary.
>
> One of the ideas I had a long time ago is to have a wheel
> implementation, which does never cascade and therefor provides
> different levels of granularity per wheel level.
>
> LVL0 1 jiffy granularity
> LVL1 8 jiffies granularity
> LVL1 64 jiffies granularity
> ....
>
> This requires more levels than the classic timer wheel, so its not
> feasible from a memory consumption POV.
>
> But we can have a compromise and put all 'soon' expiring timers
> into these fancy new wheels and stick all 'far out' timers into the
> last level of the wheel and cascade them occasionally.
>
> That should work because:
>
> - Most timers are short term expiry (< 1h)
> - Most timers are canceled before expiry
>
> So we need a sensible upper bound of levels and get the following
> properties:
>
> - Natural batching (no more slack magic). This might also help
> networking to avoid rearming of timers.
>
> - Long out timers are queued in the slowest wheel and
> ocassionally with the granularity of the slowest wheel
> cascaded
>
> - Tracking the next expiring timer can be done with a bitmap,
> so the 'get next expiring timer' becomes O(1) without
> touching anything else than the bitmap, if we accept that
> the upper limit of what we can retrieve O(1) is the
> granularity of the last level, i.e. we treat timers which
> need recascading simple as normal inhabitants of the last
> wheel level.
>
> - The expiry code (run_timers) gets more complicated as it
> needs to walk through the different levels more often, but
> with the bitmap that shouldnt be a too big issue.
>
> - Seperate storage for non-deferrable and deferrable timers.
>
> I spent some time in coding that up. It barely boots and has
> certainly a few bugs left and right, but I will send it out
> nevertheless as a reply to this mail to get the discussion going.

Here is the proof of concept implementation for the above.

Be aware that it might eat your disk, kill you cat and make your kids
miss the bus, but with some care one it should be able to tame the
beast and get some real testing done.

I have no idea how that behaves performance wise as I was trying to
shakeout the bugs.

Not-Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
---
include/linux/list.h | 10
include/linux/timer.h | 2
include/trace/events/timer.h | 11
kernel/time/timer.c | 657 ++++++++++++++++++++++---------------------
4 files changed, 369 insertions(+), 311 deletions(-)

Index: tip/include/linux/list.h
===================================================================
--- tip.orig/include/linux/list.h
+++ tip/include/linux/list.h
@@ -673,6 +673,16 @@ static inline void hlist_add_fake(struct
}

/*
+ * Check whether the node is the last node of the head without
+ * accessing head.
+ */
+static inline bool hlist_is_last_node(struct hlist_node *n,
+ struct hlist_head *h)
+{
+ return !n->next && n->pprev == &h->first;
+}
+
+/*
* Move a list from one list head to another. Fixup the pprev
* reference of the first entry if it exists.
*/
Index: tip/include/linux/timer.h
===================================================================
--- tip.orig/include/linux/timer.h
+++ tip/include/linux/timer.h
@@ -63,6 +63,8 @@ struct timer_list {
#define TIMER_BASEMASK (TIMER_CPUMASK | TIMER_MIGRATING)
#define TIMER_DEFERRABLE 0x00100000
#define TIMER_IRQSAFE 0x00200000
+#define TIMER_HASHSHIFT 23
+#define TIMER_HASHMASK 0xFF800000

#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .next = TIMER_ENTRY_STATIC }, \
Index: tip/include/trace/events/timer.h
===================================================================
--- tip.orig/include/trace/events/timer.h
+++ tip/include/trace/events/timer.h
@@ -126,6 +126,17 @@ DEFINE_EVENT(timer_class, timer_cancel,
);

/**
+ * timer_cascade - called when the timer is cascaded
+ * @timer: pointer to struct timer_list
+ */
+DEFINE_EVENT(timer_class, timer_cascade,
+
+ TP_PROTO(struct timer_list *timer),
+
+ TP_ARGS(timer)
+);
+
+/**
* hrtimer_init - called when the hrtimer is initialized
* @hrtimer: pointer to struct hrtimer
* @clockid: the hrtimers clock
Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -59,42 +59,150 @@ __visible u64 jiffies_64 __cacheline_ali
EXPORT_SYMBOL(jiffies_64);

/*
- * per-CPU timer vector definitions:
+ * The timer wheel has LVL_DEPTH hash levels. Each level provides a
+ * hash of LVL_SIZE hash buckets. Each level is driven by its own
+ * clock and therefor each level has a different granularity.
+ *
+ * The level granularity is: LVL_CLK_DIV ^ lvl
+ * The level clock frequency is: HZ / (LVL_CLK_DIV ^ level)
+ *
+ * The hash level of a newly armed timer depends on the relative
+ * expiry time. The farther the expiry time is away the higher the
+ * hash level and therefor the granularity becomes.
+ *
+ * Contrary to the original timer wheel implementation, which aims for
+ * 'exact' expiry of the timers, this implementation mostlu removes
+ * the need for recascading the timers into the lower hash levels. The
+ * previous 'classic' timer wheel implementation of the kernel already
+ * violated the 'exact' expiry by adding slack to the expiry time to
+ * provide batched expiries. The granularity levels provide implicit
+ * batching.
+ *
+ * This is an optimization of the original timer wheel implementation
+ * for the majority of the timer wheel use cases: timeouts. The vast
+ * majority of timeout timers (networking, disk I/O ...) are canceled
+ * before expiry. If the timeout expires it indicates that normal
+ * operation is disturbed, so it does not matter much whether the
+ * timeout comes with a slight delay.
+ *
+ * The currently chosen hash constants values are a good compromise
+ * between hash size and granularity. This results in the following
+ * granularity and range levels:
+ *
+ * HZ 1000
+ * Level Offset Granularity Range
+ * 0 0 1 ms 0 ms - 31 ms
+ * 1 32 8 ms 32 ms - 255 ms
+ * 2 64 64 ms 256 ms - 2047 ms (256ms - ~2s)
+ * 3 96 512 ms 2048 ms - 16383 ms (~2s - ~16s)
+ * 4 128 4096 ms (~4s) 16384 ms - 131071 ms (~16s - ~2m)
+ * 5 160 32768 ms (~32s) 131072 ms - 1048575 ms (~2m - ~17m
+ * 6 192 262144 ms (~4m) 1048576 ms - 8388607 ms (~17m - ~2h)
+ * 7 224 2097152 ms (~34m) 8388608 ms - 67108863 ms (~2h - ~18h)
+ *
+ * HZ 250
+ * Level Offset Granularity Range
+ * 0 0 4 ms 0 ms - 124 ms
+ * 1 32 32 ms 128 ms - 1020 ms (128ms - ~1s)
+ * 2 64 256 ms 1024 ms - 8188 ms (~1s - ~8s)
+ * 3 96 2048 ms (~2s) 8192 ms - 65532 ms (~8s - ~1m)
+ * 4 128 16384 ms (~16s) 65536 ms - 524284 ms (~1m - ~8m
+ * 5 160 131072 ms (~2m) 524288 ms - 4194300 ms (~8m - ~1h)
+ * 6 192 1048576 ms (~17m) 4194304 ms - 33554428 ms (~1h - ~9h)
+ * 7 224 8388608 ms (~2h) 33554432 ms - 268435452 ms (~9h - ~3d)
+ *
+ * HZ 100
+ * Level Offset Granularity Range
+ * 0 0 10 ms 0 ms - 310 ms
+ * 1 32 80 ms 320 ms - 2550 ms (320ms - ~2s)
+ * 2 64 640 ms 2560 ms - 20470 ms (~2s - ~20s)
+ * 3 96 5120 ms (~5s) 20480 ms - 163830 ms (~20s - ~2m)
+ * 4 128 40960 ms (~40s) 163840 ms - 1310710 ms (~2m - ~21m
+ * 5 160 327680 ms (~5m) 1310720 ms - 10485750 ms (~21m - ~2h)
+ * 6 192 2621440 ms (~43m) 10485760 ms - 83886070 ms (~2h - ~23h)
+ * 7 224 20971520 ms (~5h) 83886080 ms - 671088630 ms (~23h - ~7d)
+ */
+
+/* Size of each clock level */
+#define LVL_SIZE 32
+#define LVL_SHIFT 5
+#define LVL_MASK (LVL_SIZE - 1)
+
+/* Level depth */
+#define LVL_DEPTH 8
+
+/*
+ * The resulting hash size. If NOHZ is configured we allocate two
+ * hashes so we have a seperate storage for the deferrable timers.
*/
-#define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
-#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
-#define TVN_SIZE (1 << TVN_BITS)
-#define TVR_SIZE (1 << TVR_BITS)
-#define TVN_MASK (TVN_SIZE - 1)
-#define TVR_MASK (TVR_SIZE - 1)
-#define MAX_TVAL ((unsigned long)((1ULL << (TVR_BITS + 4*TVN_BITS)) - 1))
+#define HASH_SIZE (LVL_SIZE * LVL_DEPTH)
+#ifdef CONFIG_NO_HZ_COMMON
+# define TOT_LVL_DEPTH (LVL_DEPTH * 2)
+# define TOT_HASH_SIZE (HASH_SIZE * 2)
+# define HASH_DEFERRABLE_OFFSET (HASH_SIZE)
+#else
+# define TOT_LVL_DEPTH (LVL_DEPTH)
+# define TOT_HASH_SIZE (HASH_SIZE)
+# define HASH_DEFERRABLE_OFFSET (0)
+#endif

-struct tvec {
- struct hlist_head vec[TVN_SIZE];
-};
+/* Level offsets in the hash */
+#define LVL0_OFFS (0)
+#define LVL1_OFFS (LVL_SIZE)
+#define LVL2_OFFS (LVL1_OFFS + LVL_SIZE)
+#define LVL3_OFFS (LVL2_OFFS + LVL_SIZE)
+#define LVL4_OFFS (LVL3_OFFS + LVL_SIZE)
+#define LVL5_OFFS (LVL4_OFFS + LVL_SIZE)
+#define LVL6_OFFS (LVL5_OFFS + LVL_SIZE)
+#define LVL7_OFFS (LVL6_OFFS + LVL_SIZE)
+
+/* Clock divisor for the next level */
+#define LVL_CLK_SHIFT 3
+#define LVL_CLK_DIV (1 << LVL_CLK_SHIFT)
+#define LVL_CLK_MASK (LVL_CLK_DIV - 1)
+
+/* The shift constants for selecting the bucket at the levels */
+#define LVL1_SHIFT (1 * LVL_CLK_SHIFT)
+#define LVL2_SHIFT (2 * LVL_CLK_SHIFT)
+#define LVL3_SHIFT (3 * LVL_CLK_SHIFT)
+#define LVL4_SHIFT (4 * LVL_CLK_SHIFT)
+#define LVL5_SHIFT (5 * LVL_CLK_SHIFT)
+#define LVL6_SHIFT (6 * LVL_CLK_SHIFT)
+#define LVL7_SHIFT (7 * LVL_CLK_SHIFT)
+
+/* The granularity of each level */
+#define LVL0_GRAN 0x00000001
+#define LVL1_GRAN (LVL0_GRAN << LVL_CLK_SHIFT)
+#define LVL2_GRAN (LVL1_GRAN << LVL_CLK_SHIFT)
+#define LVL3_GRAN (LVL2_GRAN << LVL_CLK_SHIFT)
+#define LVL4_GRAN (LVL3_GRAN << LVL_CLK_SHIFT)
+#define LVL5_GRAN (LVL4_GRAN << LVL_CLK_SHIFT)
+#define LVL6_GRAN (LVL5_GRAN << LVL_CLK_SHIFT)
+#define LVL7_GRAN (LVL6_GRAN << LVL_CLK_SHIFT)

-struct tvec_root {
- struct hlist_head vec[TVR_SIZE];
-};
+/*
+ * The time start value for each level to select the bucket at enqueue
+ * time.
+ */
+#define LVL1_TSTART (LVL_SIZE - 1)
+#define LVL2_TSTART (LVL1_TSTART << LVL_CLK_SHIFT)
+#define LVL3_TSTART (LVL2_TSTART << LVL_CLK_SHIFT)
+#define LVL4_TSTART (LVL3_TSTART << LVL_CLK_SHIFT)
+#define LVL5_TSTART (LVL4_TSTART << LVL_CLK_SHIFT)
+#define LVL6_TSTART (LVL5_TSTART << LVL_CLK_SHIFT)
+#define LVL7_TSTART (LVL6_TSTART << LVL_CLK_SHIFT)

struct tvec_base {
- spinlock_t lock;
- struct timer_list *running_timer;
- unsigned long timer_jiffies;
- unsigned long next_timer;
- unsigned long active_timers;
- unsigned long all_timers;
- int cpu;
- bool migration_enabled;
- bool nohz_active;
- struct tvec_root tv1;
- struct tvec tv2;
- struct tvec tv3;
- struct tvec tv4;
- struct tvec tv5;
+ spinlock_t lock;
+ struct timer_list *running_timer;
+ unsigned long timer_jiffies;
+ unsigned int cpu;
+ bool migration_enabled;
+ bool nohz_active;
+ DECLARE_BITMAP(pending_map, TOT_HASH_SIZE);
+ struct hlist_head vectors[TOT_HASH_SIZE];
} ____cacheline_aligned;

-
static DEFINE_PER_CPU(struct tvec_base, tvec_bases);

#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
@@ -371,84 +479,77 @@ void set_timer_slack(struct timer_list *
EXPORT_SYMBOL_GPL(set_timer_slack);

/*
- * If the list is empty, catch up ->timer_jiffies to the current time.
- * The caller must hold the tvec_base lock. Returns true if the list
- * was empty and therefore ->timer_jiffies was updated.
+ * Helper function to calculate the array index for a given expiry
+ * time.
*/
-static inline bool catchup_timer_jiffies(struct tvec_base *base)
+static inline unsigned calc_index(unsigned expires, unsigned sft,
+ unsigned gran, unsigned base)
{
- if (!base->all_timers) {
- base->timer_jiffies = jiffies;
- return true;
- }
- return false;
+ return base + (((expires + gran) >> sft) & LVL_MASK);
+}
+
+static inline unsigned int get_hash_offset(struct tvec_base *base,
+ struct timer_list *timer)
+{
+ /* Compiled out if CONFIG_NOHZ_COMMON=n */
+ if ((timer->flags & TIMER_DEFERRABLE) && base->nohz_active)
+ return HASH_DEFERRABLE_OFFSET;
+ return 0;
}

static void
__internal_add_timer(struct tvec_base *base, struct timer_list *timer)
{
unsigned long expires = timer->expires;
- unsigned long idx = expires - base->timer_jiffies;
+ unsigned long delta = expires - base->timer_jiffies;
struct hlist_head *vec;
+ unsigned int idx;

- if (idx < TVR_SIZE) {
- int i = expires & TVR_MASK;
- vec = base->tv1.vec + i;
- } else if (idx < 1 << (TVR_BITS + TVN_BITS)) {
- int i = (expires >> TVR_BITS) & TVN_MASK;
- vec = base->tv2.vec + i;
- } else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {
- int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
- vec = base->tv3.vec + i;
- } else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
- int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
- vec = base->tv4.vec + i;
- } else if ((signed long) idx < 0) {
- /*
- * Can happen if you add a timer with expires == jiffies,
- * or you set a timer to go off in the past
- */
- vec = base->tv1.vec + (base->timer_jiffies & TVR_MASK);
+ if (delta < LVL1_TSTART) {
+ idx = (expires + LVL0_GRAN) & LVL_MASK;
+ } else if (delta < LVL2_TSTART) {
+ idx = calc_index(expires, LVL1_GRAN, LVL1_SHIFT, LVL1_OFFS);
+ } else if (delta < LVL3_TSTART) {
+ idx = calc_index(expires, LVL2_GRAN, LVL2_SHIFT, LVL2_OFFS);
+ } else if (delta < LVL4_TSTART) {
+ idx = calc_index(expires, LVL3_GRAN, LVL3_SHIFT, LVL3_OFFS);
+ } else if (delta < LVL5_TSTART) {
+ idx = calc_index(expires, LVL4_GRAN, LVL4_SHIFT, LVL4_OFFS);
+ } else if (delta < LVL6_TSTART) {
+ idx = calc_index(expires, LVL5_GRAN, LVL5_SHIFT, LVL5_OFFS);
+ } else if ((long) delta < 0) {
+ idx = base->timer_jiffies & LVL_MASK;
} else {
- int i;
- /* If the timeout is larger than MAX_TVAL (on 64-bit
- * architectures or with CONFIG_BASE_SMALL=1) then we
- * use the maximum timeout.
+ /*
+ * The long timeouts go into the last hash level. They
+ * have to be cascaded eventually, but most of them
+ * are removed way before cascading takes place.
*/
- if (idx > MAX_TVAL) {
- idx = MAX_TVAL;
- expires = idx + base->timer_jiffies;
- }
- i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
- vec = base->tv5.vec + i;
+ idx = calc_index(expires, LVL7_GRAN, LVL7_SHIFT, LVL7_OFFS);
}

+ idx += get_hash_offset(base, timer);
+ /*
+ * Enqueue the timer into the hash bucket, mark it pending in
+ * the bitmap and store the index in the timer flags.
+ */
+ vec = base->vectors + idx;
hlist_add_head(&timer->entry, vec);
+ __set_bit(idx, base->pending_map);
+ timer->flags = (timer->flags & TIMER_HASHMASK) | idx << TIMER_HASHSHIFT;
}

static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
{
- /* Advance base->jiffies, if the base is empty */
- if (!base->all_timers++)
- base->timer_jiffies = jiffies;
-
__internal_add_timer(base, timer);
- /*
- * Update base->active_timers and base->next_timer
- */
- if (!(timer->flags & TIMER_DEFERRABLE)) {
- if (!base->active_timers++ ||
- time_before(timer->expires, base->next_timer))
- base->next_timer = timer->expires;
- }

/*
* Check whether the other CPU is in dynticks mode and needs
- * to be triggered to reevaluate the timer wheel.
- * We are protected against the other CPU fiddling
- * with the timer by holding the timer base lock. This also
- * makes sure that a CPU on the way to stop its tick can not
- * evaluate the timer wheel.
+ * to be triggered to reevaluate the timer wheel. We are
+ * protected against the other CPU fiddling with the timer by
+ * holding the timer base lock. This also makes sure that a
+ * CPU on the way to stop its tick can not evaluate the timer
+ * wheel.
*
* Spare the IPI for deferrable timers on idle targets though.
* The next busy ticks will take care of it. Except full dynticks
@@ -728,30 +829,23 @@ static inline void detach_timer(struct t
entry->next = LIST_POISON2;
}

-static inline void
-detach_expired_timer(struct timer_list *timer, struct tvec_base *base)
+static inline void detach_expired_timer(struct timer_list *timer)
{
detach_timer(timer, true);
- if (!(timer->flags & TIMER_DEFERRABLE))
- base->active_timers--;
- base->all_timers--;
}

static int detach_if_pending(struct timer_list *timer, struct tvec_base *base,
bool clear_pending)
{
+ unsigned idx = (timer->flags & TIMER_HASHMASK) >> TIMER_HASHSHIFT;
+
if (!timer_pending(timer))
return 0;

+ if (hlist_is_last_node(&timer->entry, base->vectors + idx))
+ __clear_bit(idx, base->pending_map);
+
detach_timer(timer, clear_pending);
- if (!(timer->flags & TIMER_DEFERRABLE)) {
- base->active_timers--;
- if (timer->expires == base->next_timer)
- base->next_timer = base->timer_jiffies;
- }
- /* If this was the last timer, advance base->jiffies */
- if (!--base->all_timers)
- base->timer_jiffies = jiffies;
return 1;
}

@@ -785,14 +879,34 @@ static struct tvec_base *lock_timer_base
}
}

-static inline int
-__mod_timer(struct timer_list *timer, unsigned long expires,
- bool pending_only, int pinned)
+static int __mod_timer(struct timer_list *timer, unsigned long expires,
+ bool pending_only, int pinned)
{
struct tvec_base *base, *new_base;
unsigned long flags;
int ret = 0;

+ /*
+ * TODO: Calculate the hash bucket of the timer right here w/o
+ * holding the base lock. This allows to check not only
+ * timer->expires == expires below, but also whether the timer
+ * ends up in the same bucket. If we really need to requeue
+ * the timer then we check whether base->timer_jiffies have
+ * advanced between here and locking the timer base. If
+ * jiffies advanced we have to recalc the hash bucket with the
+ * lock held.
+ */
+
+ /*
+ * This is a common optimization triggered by the
+ * networking code - if the timer is re-modified
+ * to be the same thing then just return:
+ */
+ if (timer_pending(timer)) {
+ if (timer->expires == expires)
+ return 1;
+ }
+
timer_stats_timer_set_start_info(timer);
BUG_ON(!timer->function);

@@ -851,45 +965,6 @@ int mod_timer_pending(struct timer_list
}
EXPORT_SYMBOL(mod_timer_pending);

-/*
- * Decide where to put the timer while taking the slack into account
- *
- * Algorithm:
- * 1) calculate the maximum (absolute) time
- * 2) calculate the highest bit where the expires and new max are different
- * 3) use this bit to make a mask
- * 4) use the bitmask to round down the maximum time, so that all last
- * bits are zeros
- */
-static inline
-unsigned long apply_slack(struct timer_list *timer, unsigned long expires)
-{
- unsigned long expires_limit, mask;
- int bit;
-
- if (timer->slack >= 0) {
- expires_limit = expires + timer->slack;
- } else {
- long delta = expires - jiffies;
-
- if (delta < 256)
- return expires;
-
- expires_limit = expires + delta / 256;
- }
- mask = expires ^ expires_limit;
- if (mask == 0)
- return expires;
-
- bit = find_last_bit(&mask, BITS_PER_LONG);
-
- mask = (1UL << bit) - 1;
-
- expires_limit = expires_limit & ~(mask);
-
- return expires_limit;
-}
-
/**
* mod_timer - modify a timer's timeout
* @timer: the timer to be modified
@@ -912,16 +987,6 @@ unsigned long apply_slack(struct timer_l
*/
int mod_timer(struct timer_list *timer, unsigned long expires)
{
- expires = apply_slack(timer, expires);
-
- /*
- * This is a common optimization triggered by the
- * networking code - if the timer is re-modified
- * to be the same thing then just return:
- */
- if (timer_pending(timer) && timer->expires == expires)
- return 1;
-
return __mod_timer(timer, expires, false, TIMER_NOT_PINNED);
}
EXPORT_SYMBOL(mod_timer);
@@ -947,9 +1012,6 @@ EXPORT_SYMBOL(mod_timer);
*/
int mod_timer_pinned(struct timer_list *timer, unsigned long expires)
{
- if (timer->expires == expires && timer_pending(timer))
- return 1;
-
return __mod_timer(timer, expires, false, TIMER_PINNED);
}
EXPORT_SYMBOL(mod_timer_pinned);
@@ -1120,27 +1182,6 @@ int del_timer_sync(struct timer_list *ti
EXPORT_SYMBOL(del_timer_sync);
#endif

-static int cascade(struct tvec_base *base, struct tvec *tv, int index)
-{
- /* cascade all the timers from tv up one level */
- struct timer_list *timer;
- struct hlist_node *tmp;
- struct hlist_head tv_list;
-
- hlist_move_list(tv->vec + index, &tv_list);
-
- /*
- * We are removing _all_ timers from the list, so we
- * don't have to detach them individually.
- */
- hlist_for_each_entry_safe(timer, tmp, &tv_list, entry) {
- /* No accounting, while moving them */
- __internal_add_timer(base, timer);
- }
-
- return index;
-}
-
static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long),
unsigned long data)
{
@@ -1184,66 +1225,97 @@ static void call_timer_fn(struct timer_l
}
}

-#define INDEX(N) ((base->timer_jiffies >> (TVR_BITS + (N) * TVN_BITS)) & TVN_MASK)
+static void expire_timers(struct tvec_base *base, struct hlist_head *head)
+{
+ while (!hlist_empty(head)) {
+ struct timer_list *timer;
+ void (*fn)(unsigned long);
+ unsigned long data;
+
+ timer = hlist_entry(head->first, struct timer_list, entry);
+ fn = timer->function;
+ data = timer->data;
+
+ timer_stats_account_timer(timer);
+
+ base->running_timer = timer;
+ detach_expired_timer(timer);
+
+ if (timer->flags & TIMER_IRQSAFE) {
+ spin_unlock(&base->lock);
+ call_timer_fn(timer, fn, data);
+ spin_lock(&base->lock);
+ } else {
+ spin_unlock_irq(&base->lock);
+ call_timer_fn(timer, fn, data);
+ spin_lock_irq(&base->lock);
+ }
+ }
+}
+
+static int collect_expired_timers(struct tvec_base *base,
+ struct hlist_head *heads,
+ unsigned offset)
+{
+ unsigned clock = base->timer_jiffies++;
+ struct hlist_head tmp, *vec;
+ int i, levels = 0;
+ unsigned idx;
+
+ /* Expire the regular buckets */
+ for (i = 0; i < LVL_DEPTH - 1; i++) {
+ idx = offset + (clock & LVL_MASK) + i * LVL_SIZE;
+
+ if (__test_and_clear_bit(idx, base->pending_map)) {
+ vec = base->vectors + idx;
+ hlist_move_list(vec, heads++);
+ levels++;
+ }
+ /* Is it time to look at the next level? */
+ if (clock & LVL_CLK_MASK)
+ return levels;
+ /* Shift clock for the next level granularity */
+ clock >>= LVL_CLK_SHIFT;
+ }
+
+ /* Cascading, sigh... */
+ idx = offset + (clock & LVL_MASK) + i * LVL_SIZE;
+
+ if (__test_and_clear_bit(idx, base->pending_map)) {
+ vec = base->vectors + idx;
+ hlist_move_list(vec, &tmp);
+ while (!hlist_empty(&tmp)) {
+ struct timer_list *timer;
+
+ timer = hlist_entry(tmp.first, struct timer_list, entry);
+ trace_timer_cascade(timer);
+ __hlist_del(&timer->entry);
+ __internal_add_timer(base, timer);
+ }
+ }
+ return levels;
+}

/**
* __run_timers - run all expired timers (if any) on this CPU.
* @base: the timer vector to be processed.
- *
- * This function cascades all vectors and executes all expired timer
- * vectors.
*/
static inline void __run_timers(struct tvec_base *base)
{
- struct timer_list *timer;
-
spin_lock_irq(&base->lock);

while (time_after_eq(jiffies, base->timer_jiffies)) {
- struct hlist_head work_list;
- struct hlist_head *head = &work_list;
- int index;
-
- if (catchup_timer_jiffies(base))
- break;
+ struct hlist_head heads[TOT_LVL_DEPTH];
+ int levels = collect_expired_timers(base, heads, 0);

- index = base->timer_jiffies & TVR_MASK;
+#ifdef CONFIG_NO_HZ_COMMON
+ if (base->nohz_active)
+ levels += collect_expired_timers(base, heads + levels,
+ HASH_SIZE);
+#endif

- /*
- * Cascade timers:
- */
- if (!index &&
- (!cascade(base, &base->tv2, INDEX(0))) &&
- (!cascade(base, &base->tv3, INDEX(1))) &&
- !cascade(base, &base->tv4, INDEX(2)))
- cascade(base, &base->tv5, INDEX(3));
- ++base->timer_jiffies;
- hlist_move_list(base->tv1.vec + index, head);
- while (!hlist_empty(head)) {
- void (*fn)(unsigned long);
- unsigned long data;
- bool irqsafe;
-
- timer = hlist_entry(head->first, struct timer_list, entry);
- fn = timer->function;
- data = timer->data;
- irqsafe = timer->flags & TIMER_IRQSAFE;
-
- timer_stats_account_timer(timer);
-
- base->running_timer = timer;
- detach_expired_timer(timer, base);
-
- if (irqsafe) {
- spin_unlock(&base->lock);
- call_timer_fn(timer, fn, data);
- spin_lock(&base->lock);
- } else {
- spin_unlock_irq(&base->lock);
- call_timer_fn(timer, fn, data);
- spin_lock_irq(&base->lock);
- }
- }
+ while (levels--)
+ expire_timers(base, heads + levels);
}
base->running_timer = NULL;
spin_unlock_irq(&base->lock);
@@ -1251,78 +1323,57 @@ static inline void __run_timers(struct t

#ifdef CONFIG_NO_HZ_COMMON
/*
- * Find out when the next timer event is due to happen. This
- * is used on S/390 to stop all activity when a CPU is idle.
- * This function needs to be called with interrupts disabled.
+ * Find the next pending bucket of a level. Search from @offset + @clk
+ * upwards and if nothing there, search from start of the level
+ * (@offset) up to @offset + @clk.
+ */
+static int next_pending_bucket(struct tvec_base *base, unsigned offset,
+ unsigned clk)
+{
+ unsigned pos, start = offset + clk;
+ unsigned end = offset + LVL_SIZE;
+
+ pos = find_next_bit(base->pending_map, end, start);
+
+ if (pos < end)
+ return pos - start;
+ pos = find_next_bit(base->pending_map, start, offset);
+ return pos < start ? pos + end - start : -1;
+}
+
+/*
+ * Search the first expiring timer in the various clock levels.
+ *
+ * Note: This implementation might be suboptimal vs. timers enqueued
+ * in the cascade level because we do not look at the timers to
+ * figure out when they really expire. So for now, we just treat
+ * the cascading timers like any other timer. If each cascading
+ * bucket has a timer, we wake up with the granularity of the
+ * last level. For now I wanted to keep this simple and maintain
+ * the constant retrieval time.
*/
static unsigned long __next_timer_interrupt(struct tvec_base *base)
{
- unsigned long timer_jiffies = base->timer_jiffies;
- unsigned long expires = timer_jiffies + NEXT_TIMER_MAX_DELTA;
- int index, slot, array, found = 0;
- struct timer_list *nte;
- struct tvec *varray[4];
-
- /* Look for timer events in tv1. */
- index = slot = timer_jiffies & TVR_MASK;
- do {
- hlist_for_each_entry(nte, base->tv1.vec + slot, entry) {
- if (nte->flags & TIMER_DEFERRABLE)
- continue;
-
- found = 1;
- expires = nte->expires;
- /* Look at the cascade bucket(s)? */
- if (!index || slot < index)
- goto cascade;
- return expires;
- }
- slot = (slot + 1) & TVR_MASK;
- } while (slot != index);
+ unsigned long delta = NEXT_TIMER_MAX_DELTA;
+ unsigned long clk = base->timer_jiffies;
+ unsigned lvl, offset = 0;

-cascade:
- /* Calculate the next cascade event */
- if (index)
- timer_jiffies += TVR_SIZE - index;
- timer_jiffies >>= TVR_BITS;
-
- /* Check tv2-tv5. */
- varray[0] = &base->tv2;
- varray[1] = &base->tv3;
- varray[2] = &base->tv4;
- varray[3] = &base->tv5;
-
- for (array = 0; array < 4; array++) {
- struct tvec *varp = varray[array];
-
- index = slot = timer_jiffies & TVN_MASK;
- do {
- hlist_for_each_entry(nte, varp->vec + slot, entry) {
- if (nte->flags & TIMER_DEFERRABLE)
- continue;
-
- found = 1;
- if (time_before(nte->expires, expires))
- expires = nte->expires;
- }
- /*
- * Do we still search for the first timer or are
- * we looking up the cascade buckets ?
- */
- if (found) {
- /* Look at the cascade bucket(s)? */
- if (!index || slot < index)
- break;
- return expires;
- }
- slot = (slot + 1) & TVN_MASK;
- } while (slot != index);
-
- if (index)
- timer_jiffies += TVN_SIZE - index;
- timer_jiffies >>= TVN_BITS;
+ spin_lock(&base->lock);
+ for (lvl = 0; lvl < LVL_DEPTH; lvl++, offset += LVL_SIZE) {
+ int pos = next_pending_bucket(base, offset, clk & LVL_MASK);
+
+ if (pos >= 0) {
+ unsigned long tmp, expires = pos + clk;
+
+ expires <<= lvl * LVL_CLK_SHIFT;
+ tmp = expires - base->timer_jiffies;
+ if (tmp < delta)
+ delta = tmp;
+ }
+ clk >>= LVL_CLK_SHIFT;
}
- return expires;
+ spin_unlock(&base->lock);
+ return delta;
}

/*
@@ -1379,17 +1430,11 @@ u64 get_next_timer_interrupt(unsigned lo
if (cpu_is_offline(smp_processor_id()))
return expires;

- spin_lock(&base->lock);
- if (base->active_timers) {
- if (time_before_eq(base->next_timer, base->timer_jiffies))
- base->next_timer = __next_timer_interrupt(base);
- nextevt = base->next_timer;
- if (time_before_eq(nextevt, basej))
- expires = basem;
- else
- expires = basem + (nextevt - basej) * TICK_NSEC;
- }
- spin_unlock(&base->lock);
+ nextevt = __next_timer_interrupt(base);
+ if (time_before_eq(nextevt, basej))
+ expires = basem;
+ else
+ expires = basem + (nextevt - basej) * TICK_NSEC;

return cmp_next_hrtimer_event(basem, expires);
}
@@ -1588,17 +1633,8 @@ static void migrate_timers(int cpu)

BUG_ON(old_base->running_timer);

- for (i = 0; i < TVR_SIZE; i++)
- migrate_timer_list(new_base, old_base->tv1.vec + i);
- for (i = 0; i < TVN_SIZE; i++) {
- migrate_timer_list(new_base, old_base->tv2.vec + i);
- migrate_timer_list(new_base, old_base->tv3.vec + i);
- migrate_timer_list(new_base, old_base->tv4.vec + i);
- migrate_timer_list(new_base, old_base->tv5.vec + i);
- }
-
- old_base->active_timers = 0;
- old_base->all_timers = 0;
+ for (i = 0; i < HASH_SIZE; i++)
+ migrate_timer_list(new_base, old_base->vectors + i);

spin_unlock(&old_base->lock);
spin_unlock_irq(&new_base->lock);
@@ -1635,7 +1671,6 @@ static void __init init_timer_cpu(int cp
spin_lock_init(&base->lock);

base->timer_jiffies = jiffies;
- base->next_timer = base->timer_jiffies;
}

static void __init init_timer_cpus(void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/