Differences with `Kernelver':
(the description is almost patch-ordered, I am writing this while browsing
the patch from up to down)
o smp lock profiling. It's also a config option (so no overhead if you
are not interested to see how much your CPU are scaling well).
With this option enabled you'll get in /proc/stats a new per CPU field
after the number of idle cycles. Such number is the number of cycles
that such CPU spent in spinning on a SMP lock (so the time that
the CPU wasted due SMP not perfect parallelizing). It would be nice
if somebody would hack xosview to take care of this field. Note: the
time that the kernel spend on a SMP lock it's also accounted as system
time. So xosview should remove from the system time the locktime
and show the locktime at the end of the bar (I imagine it as a cold
green ;). Maybe someday I'll hack xosview myself to have a bit of
fun if nobody will do that before me ;). Ah and you also get a new
Locktime line at the end of /proc/self/status that will tell you
how many cycles such process spent in spinning over a SMP lock (and
you'll have such field for every CPU). If you disable the option
in the configuration of the kernel all lock fields will be always
zero. The overhead of this option should be not noticable (at least
if you don't have tons of modules, but hey you are just impacting
page-fault-performances ;). The last note about this is that to
get the stats also for kernel modules you need to patch insmod
and you can find the patch on e-mind.com ftp site too.
o moved get_wchan() to the arch section.
o i386 irq_desc cacheline aligned.
o global_bh_count changed from atomic_t to unsinged long because it's
always used with bitops.
o bottom half improvement (no need to sti/cli when recalled by schedule).
o i386 with TSC now will recover lost timer ticks (due too long
disabled interrupts, it's an issue also in SMP while pressing SYSRQ+M
for example).
o more verbose monitoring of SYSRQ+M.
o mips gettimeofday fix after jiffy wrap (should be just ingoblated in
the mips tree though).
o ll_rw_block must reverse pgpgout and pgpgin after a read/write ahead
request otherwise the vmstats will be screwed up.
o bttv stop generting 50irq/sec when not used.
o sbpro 16bit/44.1khz stereo software emulation (you must change nothing
on the userspace side). I did this because I have a sbpro and I got
bored to get complains about not supported sound format from
userspace ;).
o heavy changes to buffer.c (some part completly rewritten from scratch),
now all buffers leave in a rb-tree indexed by (in order) offset, device
and size. Fixed many problems starting from b_state = 0 missing
to set_blocksize/invalidate_buffer races, and flushtime not working.
o prune dcache before start swapping, the prune won't be a complete
prune but will be in function of the priority and will scale nicely.
Well the implementation it's not that clever but works fine here:
- prune_dcache(0);
+ prune_dcache(dentry_stat.nr_unused / (priority+1));
;).
o inodes can't grow over inode-max even if there are really inode-max
inodes inuse. This mean that you may get the message
"grow_inodes: allocation failed" but you'll never have allocated
more inodes than inode-max (so there's no way to leak memory, I mean
leak memory because inodes are never shrunk). Ah and if somebody
is capable of CAP_SYS_RESOURCE, he will be allowed to override
inode-max (so the root will never see such message on the console ;).
I don't know if this paranoid behaviour is really something we want
or not, but at least with the large-fd-set patch I think this will make
the admin more happy ;). Otherwise any user can open > inode-max file
descriptors pointing to different inodes and such user will leak
memory and the admin will have to reboot to get such memory back.
In the stock kernel the malicious user should fork many task to do that
(right now one task can take open as max 256 inode) so probably such
malicious user would be noticed quite easily... so now the issue
is minor.
o /proc/ race fixes. And it's not slower because I added a parameter
to grab_mm or grab_tsk that tells if we need the mm semaphore because
we are going to sleep or not.
o semaphore race fixes. I don't know why they are not been included yet.
you can find the relevant email that shows a real
sublte-unlikely-to-happens race on a past my email on linux-kernel of
some month ago (I think that searching for the keyword `semaphore'
will be enough to find it).
o there's no point in not running a bh handler if there is an irq
on the other cpu, since the irq on the other CPU can start a nsec
after the check for global_irq_count anyway.
o fixed a bh_mask_count/bh_mask race with a i386_bh_lock spinlock
(following the sparc track) and some other still minor race
(wmb() issues).
o getblk, get_hash_table, and find_buffer has to be FASTCALL ;).
o rb-tree in the page cache (not per-inode rb but only one whole rb).
o My own rb-tree implementation from scratch, everything common is placed
in include/linux/rbtree.h, everything else has to be case-specific.
Everything is inline, if you don't want it inline recall it from a
separated function in a .c file and call the separate function instead.
o my VM trashing heuristic.
o two level lru in the page cache and mapped/unfreeable pages are removed
from the lru list and moved to a separate queue to improve still
more shrink_mmap. This imply heavy changes to the whole VM. The thing
is not so simple as written here, there is some thing that may seems
minor while it's major. Removed all tuning VM parameters (except
bdflush).
o killed tqueue_lock using smart SMP write ordering to memory
(run_task_queue don't need to cli anymore).
o fixed a bug in shm.c (swap_id must be set to 0 before failing
otherwise the next pass will run on with a out-of-bound value).
o implemented the slab kmem_cache_destroy() to allow modules to use
the slab and to be insmodded more than one time. I also did
some cleanup to the slab code.
o fixed a bound check in sys_init_module() were we wasn't backwards
compatible with obsolete insmod binary
o reschedule_idle() fix, we was not checking if there was a CPU idle
(maybe the preferred one) before rescheduling the current task.
o SYSRQ+T will show the get_wchan information in the unknown field ;).
o update_shared_mappings (will greatly improve performances while
writing from many task to the same shared memory).
o some change to do_wp_page. I think we should release the big kernel
lock after setting the pte to dirty (if we are taking all the access
over a swap cache page for example). And there was a missing
unlock_kernel in the end_wp_page path.
o killed the free_after bit because we don't need it anymore. Probably
it also make sense to take it if we'll need it sometime in the future,
but I want to go faster in the meantime ;). Removing it is trivial,
no-way to insert bugs doing that (I hope these are not the famous last
words ;).
o be fair while setting the pagetables from young to old.
o fixed the shm_swap nono.
o auto-increase the rcvbuf if we are going to lose packets because
it's been set to a too low value (lower than the MTU of the device)
by userspace (avoid TCP deadlock).
Everything above cames from my own ideas and it's been developed by me in
my spare time except:
o the shm.c swap_id fix that it's been spotted and fixed by
Kanoj Sarcar <kanoj@google.engr.sgi.com> (linux-kernel of last days
and I merged it to avoid me to get an Oops ;).
o the /proc (lock field) part of the SMP profiling patch is been
developed by Andi Kleen.
o some smp.c changes/cleanup that I merged by a patch from Ingo Molnar
that was equivalent to my recover_lost_ticks() code (except that it
wasn't recovering the lost ticks).
o the original idea/implementation of removing the tqueue spinlock is
from yodaiken@chelm.cs.nmt.edu (for RT-linux if I remeber well) and I
fixed all races in it and Patrik Rak <patrik@raxoft.cz> suggested
also some good improvement further to my fixes.
I think the most interesting code to analyze is my new rb-tree and my
buffer.c/VM code. It's working great here. If you try it let me know your
impressions!
This time I tried to make this patch a bit more serious than the previous
ones (even if it's far to compile on other arch, to compile it on no-i386
arch it must be at least defined _lock_start and _lock_end symbols in
vmlinux.lds, and the pageskip code must use page->next instead of
page->next_hash).
You are suggested to try it out this my patch under any kind of load and memory
configuration and feedback how does it performs compared to a clean
2.2.6 (or whatever).
There is also an issue I remebered now. The low bound of the RTO. It kills
performances on very fast networks. Have we really to low bound the RTO to
200msec? (the same applys to the ato that is low bound to the rto). rfc793 tell
us:
RTO = min[UBOUND,max[LBOUND,(BETA*SRTT)]]
and specify that LBOUND can be for example 1 sec ;). rfc1122 seems to tell
nothing more specific.
Andrea Arcangeli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/