Re: [PATCH] sched/idle: Make idle poll dynamic per-cpu

From: Ingo Molnar
Date: Sun Jan 15 2023 - 04:16:05 EST



* Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:

> idle=poll is frequently used on ultra-low-latency systems. Examples of
> such systems are high-performance trading and 5G NVRAM. The performance
> gain is given by avoiding the idle driver machinery and by keeping the
> CPU is always in an active state - avoiding (odd) hardware heuristics that
> are out of the control of the OS.
>
> Currently, idle=poll is an all-or-nothing static option defined at
> boot time. The motivation for creating this option dynamic and per-cpu
> are two:
>
> 1) Reduce the power usage/heat by allowing only selected CPUs to
> do idle polling;
> 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
> polling only when ultra-low-latency applications are present
> on specific CPUs.
>
> Joe Mario did some experiments with this option enabled, and the results
> were significant. For example, by using dynamic idle polling on
> selected CPUs, cyclictest performance is optimal (like when using
> idle=poll), but cpu power consumption drops from 381 to 233 watts.
>
> Also, limiting idle=poll to the set of CPUs that benefits from
> it allows other CPUs to benefit from frequency boosts. Joe also
> shows that the results can be in the order of 80nsec round trip
> improvement when system-wide idle=poll was not used.
>
> The user can enable idle polling with this command:
> # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>
> And disable it via:
> # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>
> By default, all CPUs have idle polling disabled (the current behavior).
> A static key avoids the CPU mask check overhead when no idle polling
> is enabled.

Sounds useful in general.

A couple of observations:

ABI: how about putting the new file into the existing
/sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle?
Arguably this flag is an extension of it.


> extern char __cpuidle_text_start[], __cpuidle_text_end[];
>
> +/*
> + * per-cpu idle polling selector.
> + */
> +static struct cpumask cpu_poll_mask;
> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
> +
> +/*
> + * Protects the mask/static key relation.
> + */
> +DEFINE_MUTEX(cpu_poll_mutex);
> +
> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int cpu = dev->id;
> + int retval, set;
> + bool val;
> +
> + retval = kstrtobool(buf, &val);
> + if (retval)
> + return retval;
> +
> + mutex_lock(&cpu_poll_mutex);
> +
> + if (val) {
> + set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
> +
> + /*
> + * If the CPU was already on, do not increase the static key usage.
> + */
> + if (!set)
> + static_branch_inc(&cpu_poll_enabled);
> + } else {
> + set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
> +
> + /*
> + * If the CPU was already off, do not decrease the static key usage.
> + */
> + if (set)
> + static_branch_dec(&cpu_poll_enabled);
> + }

Nit: I think 'old_bit' or so is easier to read than a generic 'set'?


> +
> + mutex_unlock(&cpu_poll_mutex);

Also, is cpu_poll_mutex locking really necessary, given that these bitops
methods are atomic, and CPUs observe cpu_poll_enabled without taking any
locks?

> +static int is_cpu_idle_poll(int cpu)
> +{
> + if (static_branch_unlikely(&cpu_poll_enabled))
> + return cpumask_test_cpu(cpu, &cpu_poll_mask);
> +
> + return 0;
> +}

static inline might be justified in this case I guess.

> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup);
>
> static noinline int __cpuidle cpu_idle_poll(void)
> {
> - trace_cpu_idle(0, smp_processor_id());
> + int cpu = smp_processor_id();
> +
> + trace_cpu_idle(0, cpu);
> stop_critical_timings();
> ct_idle_enter();
> local_irq_enable();
>
> while (!tif_need_resched() &&
> - (cpu_idle_force_poll || tick_check_broadcast_expired()))
> + (cpu_idle_force_poll || tick_check_broadcast_expired()
> + || is_cpu_idle_poll(cpu)))
> cpu_relax();
>
> ct_idle_exit();
> start_critical_timings();
> - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> + trace_cpu_idle(PWR_EVENT_EXIT, cpu);
>
> return 1;

So I think the introduction of the 'cpu' local variable to clean up the
flow of cpu_idle_poll() should be a separate preparatory patch, which will
make the addition of the is_cpu_idle_poll() call a bit easier to read in
the second patch.

> }
> @@ -296,7 +384,8 @@ static void do_idle(void)
> * broadcast device expired for us, we don't want to go deep
> * idle as we know that the IPI is going to arrive right away.
> */
> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> + if (cpu_idle_force_poll || tick_check_broadcast_expired()
> + || is_cpu_idle_poll(cpu)) {
> tick_nohz_idle_restart_tick();
> cpu_idle_poll();

Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll
check, and before the tick_check_broadcast_expired() check?

Shouldn't matter to the outcome, but for consistency's sake.

Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be
implemented as 0/all setting of cpu_poll_mask, eliminating the
cpu_idle_force_poll flag? As a third patch on top.

Thanks,

Ingo