Re: [RFC] Extend mwait idle to optimize away IPIs when possible

From: Venki Pallipadi
Date: Mon Feb 06 2012 - 16:26:18 EST

On Mon, Feb 6, 2012 at 1:02 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, 2012-02-06 at 12:42 -0800, Venkatesh Pallipadi wrote:
>> smp_call_function_single and ttwu_queue_remote sends unconditional IPI
>> to target CPU. However, if the target CPU is in mwait based idle, we can
>> do IPI-less wakeups using the magical powers of monitor-mwait.
>> Doing this has certain advantages:
>> * Lower overhead on Async IPI send path. Measurements on Westmere based
>>   systems show savings on "no wait" smp_call_function_single with idle
>>   target CPU (as measured on the sender side).
>>   local socket smp_call_func cost goes from ~1600 to ~1200 cycles
>>   remote socket smp_call_func cost goes from ~2000 to ~1800 cycles
>> * Avoiding actual interrupts shows a measurable reduction (10%) in system
>>   non-idle cycles and cache-references with micro-benchmark sending IPI from
>>   one CPU to all the other mostly idle CPUs in the system.
>> * On a mostly idle system, turbostat shows a tiny decrease in C0(active) time
>>   and a corresponding increase in C6 state (Each row being 10min avg)
>>           %c0   %c1   %c6
>>   Before
>>   Run 1  1.51  2.93 95.55
>>   Run 2  1.48  2.86 95.65
>>   Run 3  1.46  2.78 95.74
>>   After
>>   Run 1  1.35  2.63 96.00
>>   Run 2  1.46  2.78 95.74
>>   Run 3  1.37  2.63 95.98
>> * As a bonus, we can avoid sched/call IPI overhead altogether in a special case.
>>   When CPU Y has woken up CPU X (which can take 50-100us to actually wakeup
>>   from a deep idle state) and CPU Z wants to send IPI to CPU X in this period.
>>   It can get it for free.
>> We started looking at this with one of our workloads where system is partially
>> busy and we noticed some kernel hotspots in find_next_bit and
>> default_send_IPI_mask_sequence_phys coming from sched wakeup (futex wakeups)
>> and networking call functions. So, this change addresses those two specific
>> IPI types. This could be extended to nohz_kick, etc.
>> Note:
>> * This only helps when target CPU is idle. When it is busy we will still send
>>   IPI as before.
>> * Only for X86_64 and mwait_idle_with_hints for now, with limited testing.
>> * Will need some accounting for these wakeups exported for powertop and friends.
>> Comments?
> Curiously you avoided the existing tsk_is_polling() magic, which IIRC is
> doing something similar for waking from the idle loop.

Yes. That needs remote CPU's current task, which extends onto rq lock,
which I was trying to avoid. So, I went with conditional waiting on
idle exit for the small window of WAKING to WOKEN state change, as we
know we are always polling in the mwait loop.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at