On Tue, 14 Mar 2000 yodaiken@fsmlabs.com wrote:
> But the SMP system should do n-times as many user<->kernel transitions
> and so should still run into preemption points at a high speed: even
> if each processor schedules less often.
yes, but as i outlined in the dual-CPU case, this is a statistical thing.
There are workloads which are in high-latency code paths, and there having
2 CPUs does not help - 100msec or 50msec doesnt matter much, the sound
buffer is getting starved. There will also be still 100msec peaks at a 50%
chance (compared to the nonpreemptive 1-CPU case). I dont think this needs
many arguing, mixing two bad latency kernel 'point of executions' just
does not help, it decreases noise but not conceptually. Latencies will get
shorter by sqrt(2) (~50%) on average, if i remember correctly. With N CPUs
it's getting better by sqrt(N), so a 4-CPU box will have twice as good
average latencies than a 1-CPU box. The peaks will be still just as bad,
and users will just hear half as many sound skipping as before, but it
wont be gone.
> > of course it has more footprint than 0 instructions [ == the current SMP
> > kernel].
>
> I thought that we had had this conversation in another setting and you
> had argued that cache bloat was absolutely critical path.
yes, thats possible :-) but here there is a (i believe big) quality
difference, and in that case i believe the argument was about some minor
microoptimization. But i do agree that doing the 'complete' preemptible
solution adds more per-spinlock-operation (cache footprint) overhead than
i'd like to have. Changing the icache footprint of spinlocks by 1 byte
adds about 5k more kernel image size, so we are very sensitive on that
one. One solution would be to make spinlocks a (highly optimized) function
call [ducking down]:
de8: 68 00 00 00 00 pushl $0x0
ded: e8 e2 ff ff ff call dd4 <__spin_lock>
de8: 68 00 00 00 00 pushl $0x0
ded: e8 e2 ff ff ff call dd4 <__spin_unlock>
20 bytes icache footprint. (__spin_lock is a special function
auto-cleaning the stack so the addl $4, %esp is not needed.)
A typical spinlock aquire+release currently is:
de8: f0 0f ba 2d 00 00 00 lock btsl $0x0,0x0
def: 00 00
df1: 0f 82 a0 00 00 00 jb e97 <dummyy+0xaf>
df7: f0 0f ba 35 00 00 00 lock btrl $0x0,0x0
dfe: 00 00
24 bytes. So a function-call based spin lock/unlock has ~20% less icache
footprint. (but obviously it has higher overhead)
the function-call based spinlock implementation could be made even lower
overhead for global spinlocks, where a helper function could do the
lock/unlock:
spin_lock_tasklist();
spin_unlock_tasklist();
this variant is half the icache footprint of inlined spinlocks.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Wed Mar 15 2000 - 21:00:28 EST