Re: [patch] voluntary-preempt-2.6.8-rc2-J3

From: Ingo Molnar
Date: Mon Jul 26 2004 - 16:08:08 EST



* Andrew Morton <akpm@xxxxxxxx> wrote:

> The bigger this thing gets, the more worried I get. Sometime this is
> going to need to be split up into individual fixes, and they need to
> be based upon an overall approach which we haven't yet settled on.

i will do that splitup. Right now i'm simply mapping how widespread the
problem is and what type of fixes we need. The situation isnt all that
bad but we might need (an optional) mechanism to make softirqs
synchronous. All of this stuff is nicely modular and i'll do a splitup
post 2.6.8 (i dont think we want to disturb 2.6.8 with any of this).

> In particular your whole approach (with voluntary_need_resched())
> doesn't work on SMP.

(for the record, voluntary_need_resched() == need_resched() - i'm only
keeping a distinction between the two to be able to track progress and
regressions between the vanilla and modified kernels.)

need_resched() indeed doesnt do a lock-break for SMP purposes.

> The approach I'm using is to unconditionally drop locks on every Nth pass
> around the loop to allow another CPU to grab the lock, do some work, drop
> the lock, then be preempted. eg:
>
> @@ -773,6 +774,12 @@ int get_user_pages(struct task_struct *t
> struct page *map = NULL;
> int lookup_write = write;
>
> + if ((++nr_pages & 63) == 0) {
> + spin_unlock(&mm->page_table_lock);
> + cpu_relax();
> + spin_lock(&mm->page_table_lock);
> + }
> +

guaranteeing latencies on SMP is hard, because the latency of a CPU
might depend on the latency of a task on another CPU - which CPU isnt
notified of the rescheduling request.

one alternative technique to yours would be to notify _all_ CPUs that a
high-prio RT task is ready to run (via a broadcast need-resched). That
way the UP latency-break techniques map to SMP in a 1:1 way.

non-RT tasks dont get this benefit, which is a difference to the UP
situation, but i dont think it would be appropriate to use the UP
behavior, due to the overhead of broadcasting.

a combination of the two techniques could be used too: a global 'break
locks from now on' flag which gets set if a (RT?) task wants to
reschedule. Normally this flag would be zero and the cacheline would be
clean and shared between all CPUs, causing no overhead. Once a task
wants to reschedule it would increase by 1 until that task has been
scheduled. This would remove the ugliness factor too, your sample code
would look:

> + if (need_lock_break()) {
> + spin_unlock(&mm->page_table_lock);
> + cpu_relax();
> + spin_lock(&mm->page_table_lock);
> + }
> +

or a shortcut:

> + cond_lock_break(&mm->page_table_lock);

there are two problems with the unconditional lock-break:

- i'm not sure cpu_relax() guarantees that another CPU can grab this
lock. It all depends on whether the lock cacheline can bounce over
to the other CPU faster than cpu_relax() finishes and this CPU
re-acquires the lock.

- the latencies will still be higher than on UP, in a hard to predict
way. Every time the codepath encounters a spinlock it has to take in
order to exit for a reschedule, the latency gets re-added. Also, the
now scheduled high-prio task will encounter the latency every time it
acquires a spinlock - while on UP it has the CPU for itself with no
interaction from other CPUs.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/