I believe the TLB flush argument is not necessarily relevent for the
following reasons:
1) This assumes that the next thread to be scheduled will be
sharing the address space. Otherwise, the TLB flush is
necessary anyway (except, of course, for pages marked with
the global bit, which the per-processor data page would not
be).
2) Typical multithreaded applications which are designed to
run on SMP systems will create N threads where N is equal
to or less than the number of processors on the system.
With good processor affinity in a scheduler one would very
seldom be scheduling more than one thread from a single
address space on the same processor.
3) Typical workloads contain more than a single multithreaded
application process, and thus the likelyhood of two
consecutive threads sharing an address space is decreased.
[Measuring this under some real-world workloads would
be interesting. I can try this with some oracle benchmark
runs under 2.2]
I suspect the additional cost of the clone operation (if it
is truely that significant) is outweighed by the benefits of the
functionality provided by per-processor data areas.
Some additional examples of benefits of a per-processor data
area (e.g. 1 4MB page) would include local storage for per-processor
kernel virtual memory allocator pools, processor idt, gdt and tss
(to prevent LOCK# conflicts when hw accesses gdt/idt, etc.), access
to such data structure not requiring any lock protocol or
locked bus transactions, as they are only visible to one
processor.
Since only one level 2 page table will be different between
threads running in the same address space, and that page table only
describes kernel virtual addresses, in a region where no changes
will be made during system operations to the page table, I believe
the additional cost for clone will be much less (while the page directories
will be different, all but one page tables would be shared).
Processor local data, by its very inaccessibility to other processors,
can help to increase system reliability, as well, by precluding
inadvertent modification by another processor, algorithimic
difficulties with structures shared between cpus and broken
device drivers and/or loadable modules.
I believe a structure should be associated with
a fixed virtual address in vmlinux.lds such that all processors
will automatically access processor local data through the
per-processor mappings in the kernel virtual address space.
scott lurndal
sgi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/