Re: vmstat: On demand vmstat workers V3

From: Frederic Weisbecker
Date: Sat Nov 16 2013 - 10:42:39 EST


On Thu, Oct 03, 2013 at 05:40:40PM +0000, Christoph Lameter wrote:
> V2->V3:
> - Introduce a new tick_get_housekeeping_cpu() function. Not sure
> if that is exactly what we want but it is a start. Thomas?

Not really. Thomas suggested an infrastructure to move CPU-local periodic
jobs handling to be offlined to set of remote housekeeping CPU.

This could be potentially useful for many kind of stats relying on
periodic updates, the scheduler tick being a candidate (I have yet to
check if we can really apply that in practice though).

Now the problem is that vmstats updates use pure local lockless
operations. It may be possible to offline this update to remote CPUs
but then we need to convert vmstats updates to use locks. Which is
potentially costly. Unless we can find some clever lockless update
scheme. Do you think this can be possible?

See below for more detailed review:

[...]
>
> /*
> @@ -1213,12 +1229,15 @@ static const struct file_operations proc
> #ifdef CONFIG_SMP
> static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
> int sysctl_stat_interval __read_mostly = HZ;
> +static struct cpumask *monitored_cpus;
>
> static void vmstat_update(struct work_struct *w)
> {
> - refresh_cpu_vm_stats();
> - schedule_delayed_work(&__get_cpu_var(vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + if (refresh_cpu_vm_stats())
> + schedule_delayed_work(this_cpu_ptr(&vmstat_work),
> + round_jiffies_relative(sysctl_stat_interval));
> + else
> + cpumask_set_cpu(smp_processor_id(), monitored_cpus);

That looks racy against other CPUs that may set their own bit and also
against the shepherd that clears processed monitored CPUs.

That seem to matter because a CPU could be simply entirely forgotten
by vmstat and never processed again.

> }
>
> static void start_cpu_timer(int cpu)
> @@ -1226,7 +1245,70 @@ static void start_cpu_timer(int cpu)
> struct delayed_work *work = &per_cpu(vmstat_work, cpu);
>
> INIT_DEFERRABLE_WORK(work, vmstat_update);
> - schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
> + schedule_delayed_work_on(cpu, work,
> + __round_jiffies_relative(sysctl_stat_interval, cpu));
> +}
> +
> +/*
> + * Check if the diffs for a certain cpu indicate that
> + * an update is needed.
> + */
> +static bool need_update(int cpu)
> +{
> + struct zone *zone;
> +
> + for_each_populated_zone(zone) {
> + struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
> +
> + /*
> + * The fast way of checking if there are any vmstat diffs.
> + * This works because the diffs are byte sized items.
> + */
> + if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
> + return true;
> + }
> + return false;
> +}
> +
> +static void vmstat_shepherd(struct work_struct *w)
> +{
> + int cpu;
> + int s = tick_get_housekeeping_cpu();
> + struct delayed_work *d = per_cpu_ptr(&vmstat_work, s);
> +
> + refresh_cpu_vm_stats();
> +
> + for_each_cpu(cpu, monitored_cpus)
> + if (need_update(cpu)) {
> + cpumask_clear_cpu(cpu, monitored_cpus);
> + start_cpu_timer(cpu);
> + }
> +
> + if (s != smp_processor_id()) {
> + /* Timekeeping was moved. Move the shepherd worker */
> + cpumask_set_cpu(smp_processor_id(), monitored_cpus);
> + cpumask_clear_cpu(s, monitored_cpus);
> + cancel_delayed_work_sync(d);
> + INIT_DEFERRABLE_WORK(d, vmstat_shepherd);
> + }
> +
> + schedule_delayed_work_on(s, d,
> + __round_jiffies_relative(sysctl_stat_interval, s));

Note that on dynticks idle (CONFIG_NO_HZ_IDLE=y), the timekeeper CPU can change quickly and often.

I can imagine a nasty race there: CPU 0 is the timekeeper. It schedules the
vmstat sherpherd work in 2 seconds. But CPU 0 goes to sleep for a big while
and some other CPU takes the timekeeping duty. The shepherd timer won't be
processed until CPU 0 wakes up although we may have CPUs to monitor.

CONFIG_NO_HZ_FULL may work incidentally because CPU 0 is the only timekeeper there
but this is a temporary limitation. Expect the timekeeper to be dynamic in the future
under that config.

> +
> +}
> +
> +static void __init start_shepherd_timer(void)
> +{
> + int cpu = tick_get_housekeeping_cpu();
> + struct delayed_work *d = per_cpu_ptr(&vmstat_work, cpu);
> +
> + INIT_DEFERRABLE_WORK(d, vmstat_shepherd);
> + monitored_cpus = kmalloc(BITS_TO_LONGS(nr_cpu_ids) * sizeof(long),
> + GFP_KERNEL);
> + cpumask_copy(monitored_cpus, cpu_online_mask);
> + cpumask_clear_cpu(cpu, monitored_cpus);
> + schedule_delayed_work_on(cpu, d,
> + __round_jiffies_relative(sysctl_stat_interval, cpu));
> }

So another issue with the whole design of this patch, outside its races, is that
a CPU can run full dynticks, do some quick system call at some point and thus
make the shepherd disturb it after it goes back in userland in full dynticks.

So such a system that dynamically schedules timers on demand is enough if we
want to _minimize_ timers. But what we want is a strong guarantee that the
CPU won't be disturbed at least while it runs in userland, right?

I mean, we are not only interested in optimizations but also in guarantees if
we have an extreme workload that strongly depends on the CPU not beeing disturbed
at all. I know that some people in realtime want that. And I thought it's also
what your want, may be I misunderstood your usecase?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/