Re: [PATCH for 2.5] preemptible kernel

From: Rusty Russell (rusty@rustcorp.com.au)
Date: Tue Mar 20 2001 - 03:43:50 EST


In message <Pine.LNX.4.05.10103141653350.3094-100000@cosmic.nrg.org> you write:
> Kernel preemption is not allowed while spinlocks are held, which means
> that this patch alone cannot guarantee low preemption latencies. But
> as long held locks (in particular the BKL) are replaced by finer-grained
> locks, this patch will enable lower latencies as the kernel also becomes
> more scalable on large SMP systems.

Hi Nigel,

        I can see three problems with this approach, only one of which
is serious.

The first is code which is already SMP unsafe is now a problem for
everyone, not just the 0.1% of SMP machines. I consider this a good
thing for 2.5 though.

The second is that there are "manual" locking schemes which are used
in several places in the kernel which rely on non-preemptability;
de-facto spinlocks if you will. I consider all these uses flawed: (1)
they are often subtly broken anyway, (2) they make reading those parts
of the code much harder, and (3) they break when things like this are
done.

The third is that preemtivity conflicts with the naive
quiescent-period approach proposed for module unloading in 2.5, and
useful for several other things (eg. hotplugging CPUs). This method
relies on knowing that when a schedule() has occurred on every CPU, we
know noone is holding certain references. The simplest example is a
single linked list: you can traverse without a lock as long as you
don't sleep, and then someone can unlink a node, and wait for a
schedule on every other CPU before freeing it. The non-SMP case is a
noop. See synchonize_kernel() below.

This, too, is soluble, but it means that synchronize_kernel() must
guarantee that each task which was running or preempted in kernel
space when it was called, has been non-preemtively scheduled before
synchronize_kernel() can exit. Icky.

Thoughts?
Rusty.

--
Premature optmztion is rt of all evl. --DK

/* We could keep a schedule count for each CPU and make idle tasks schedule (some don't unless need_resched), but this scales quite well (eg. 64 processors, average time to wait for first schedule = jiffie/64. Total time for all processors = jiffie/63 + jiffie/62...

At 1024 cpus, this is about 7.5 jiffies. And that assumes noone schedules early. --RR */ void synchronize_kernel(void) { unsigned long cpus_allowed, policy, rt_priority;

/* Save current state */ cpus_allowed = current->cpus_allowed; policy = current->policy; rt_priority = current->rt_priority;

/* Create an unreal time task. */ current->policy = SCHED_FIFO; current->rt_priority = 1001 + sys_sched_get_priority_max(SCHED_FIFO);

/* Make us schedulable on all CPUs. */ current->cpus_allowed = (1UL<<smp_num_cpus)-1; /* Eliminate current cpu, reschedule */ while ((current->cpus_allowed &= ~(1 << smp_processor_id())) != 0) schedule();

/* Back to normal. */ current->cpus_allowed = cpus_allowed; current->policy = policy; current->rt_priority = rt_priority; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Mar 23 2001 - 21:00:14 EST