Re: [PATCH RFC] sched: add notifier for process migration

From: Peter Zijlstra
Date: Fri Oct 09 2009 - 18:04:11 EST


On Fri, 2009-10-09 at 14:01 -0700, Jeremy Fitzhardinge wrote:

> I'm working on adding vsyscall (vread) support for
> arch/x86/kernel/pvclock.c. The algorithm needs to look up per-cpu tsc
> parameters (aka pvclock_vcpu_time_info) so that it can compute global
> system time from the tsc. To do this, it needs to grab a consistent
> snapshot of (tsc, time_info).

time_info as in gettimeofday()?, that's supposed to be globally
consistent, so get that first and then get the tsc and you're as race
free as you're ever going to get from userspace.

> Obviously this is all racy from usermode, because there are two levels
> of scheduling going on the virtual case: kernel scheduling of tasks to
> vcpus, and hypervisor scheduling of vcpus to pcpus. The latter is dealt
> with a version number in the tsc parameter structure to indicate changes
> in the params (which could be due to scheduling, power events, etc).
>
> To deal with kernel scheduling I want a second version number to let
> usermode know they've been migrated to a new (v)cpu and need to try
> again with updated time parameters. Specifically, update the version on
> the "from" vcpu so that usermode (vsyscall) code holding an old pointer
> can see the number change and reload the cpu number and get a pointer to
> the new cpu's time_info.

/me utterly confused.

> Initially I was doing this with a preempt notifier on sched_out, but Avi
> pointed out that this was a pessimistic approximation of what I really
> want, which is notification on cross-cpu migration. And since migration
> is an inherently expensive operation, the overhead of a notifier here
> should be negligible. (Aside from that, the preempt notifier mechanism
> isn't intended to be enabled on every process on the system.)

And here you're utterly failing to explain what you want such a notifier
would do.

> So I'm proposing this patch. My questions are:
>
> 1. Does this look generally reasonable?

I'm generally confused and not at all clear as to how things would work.
Afaik the vdso is a global entity and does not contain per-cpu or
per-task state.

If you're proposing to increment a global seq count on every task
migration, then I think its a terribly bad idea.

> 2. Will this notifier actually be called every time a task gets
> migrated between CPUs? Are there cases where migration may happen
> via some other path? (Though for my particular case I only care
> about migration when the task is actually preempted; if it goes to
> sleep on one cpu and happens to wake on another then it wasn't in
> the middle of getting time so it doesn't matter.)

No, you've missed quite a lot of cases.

> 3. Or is there a better way to achieve what I want?
>
> This might also be a generally useful extension to vgetcpu() caching so
> that usermode can definitively tell whether the cpu number has changed
> under its feet and needs to be reloaded via lsl/rdtscp, rather than
> having to rely on a jiffies-based approximation.

I've got no idea how vgetcpu() works, but since the vdso page is global
and not per-task, I can't really see how it could work sanely.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/