Re: Interesting scheduling times - NOT

Kurt Garloff (garloff@kg1.ping.de)
Tue, 22 Sep 1998 13:12:19 +0200


--Pd0ReVV5GZGQvF3a
Content-Type: text/plain; charset=us-ascii

On Fri, Sep 18, 1998 at 10:51:16AM -0600, Larry McVoy wrote:
> ..
> another 2.4. To get to your 28 usecs, there would have to be 11 cache
> misses (in both L1 and L2) per link walked. If that's the case, my
> apologies and I'll go yell at Linus. But I don't think that's the case.
> I just walked the code to make sure and it looks like 6-7 cache misses
> to me: the fields referenced are (in order of reference):
>
> has_cpu // referenced in can_schedule()
> policy // referenced in goodness()
> counter // referenced in goodness()
> processor // referenced in goodness()
> mm // referenced in goodness()
> priority // referenced in goodness()
> next_run // referenced in schedule()
>
> processor is right next to has_cpu so if they don't cross a boundry,
> that's one cache line.
>
> ...
>
> An interesting test would be to go wack sched.h to put the listed fields
> all right next to each other, starting on a cache line boundry, in the
> order listed above. It's 20 bytes total, that is just 2 cache lines
> instead 6-7. If that made a big difference, you'd have a case to make that
> the run queue hurts. Even if it doesn't make a big difference, it would
> be a good thing - cache misses are bad and they get nastier on SMP. IRIX
> has been heavily wacked to put all the data you need and the lock on the
> same cache line for all the data structures that are commonly used. They
> didn't do all that work because they were bored.

Good idea, Larry. So I also spent some time without being bored ...

Here is my own statistics:
These fields are referenced in sched () for the previous task:
* processor
* need_resched
* counter
* policy
* priority
* state
* sigpending (signal_pending)
* timeout
* has_cpu (SMP)

These fields are referenced for each runnable task:
* has_cpu (SMP) (can_schedule)
* next_run
and in goodness()
* policy
* rt_priority
* counter
* processor (SMP)
* mm
* prioritiy
if we recalc counters:
* next_task (for_each_task)

If the runq is changed this is referenced:
* prev_run

So I reordered the fields in sched.h: task_struct to be more cache friendly.
All fields referenced within the runnable loop are within the second cache
line (0x20--0x3f), whereas the other fields normally used in sched() are
placed within the first cache line (0x00-0x1f). Given that the struct is
properly aligned (i.e. to 2^5=0x20), we have a minimal cache footprint.
(This is true for 32bit machines, only, because the sizes of long and of
pointers are different on 64bit machines.)

I did one change, some may dislike: I changed the type of the SMP fields
from int to char, thus preventing machines with more than 128 CPUs to be
happy. I don't know if this is an issue. Problem is I would not have been
able to fit everything important into 0x40 bytes, otherwise.
It might be that lock_depth is the most critical and needs to be changed
back to int? I'm no SMP expert. Only has_cpu and processor are critical to
fit into the cache line.

I didn't perform timings on the code, but I did make sure, Linux (UP) works
properly after applying the patch.
Richard, could you perform your tests? You can also send me your test
program. As the K6-2 L1 cache is only 2 way associative, IIRC, it should
make some difference here, too.
Linus, you may want to include it into the kernel, if it helps making
sched() a little bit faster?

Appended the diff against 2.1.122.

-- 
Kurt Garloff, Dortmund 
<K.Garloff@ping.de>
PGP key on http://student.physik.uni-dortmund.de/homepages/garloff
----------------------------------------------------------------------------
The following is a Python RSA implementation. According to the US Government
posting these four lines makes me an international arms trafficker!  Join me
in civil disobedience; add these lines of code to your .sig block to help get
this stupid and unconstitutional law changed.
============================================================================
from sys import*;from string import*;a=argv;[s,p,q]=filter(lambda x:x[:1]!=
'-',a);d='-d'in a;e,n=atol(p,16),atol(q,16);l=(len(q)+1)/2;o,inb=l-d,l-1+d
while s:s=stdin.read(inb);s and map(stdout.write,map(lambda i,b=pow(reduce(
lambda x,y:(x<<8L)+y,map(ord,s)),e,n):chr(b>>8*i&255),range(o-1,-1,-1)))

--Pd0ReVV5GZGQvF3a Content-Type: text/plain; charset=us-ascii Content-Description: 21122-task_struct.diff Content-Disposition: attachment; filename="21122-task_struct.diff"

--- linux/include/linux/sched.h.orig Thu Sep 17 10:33:04 1998 +++ linux/include/linux/sched.h Tue Sep 22 13:10:26 1998 @@ -207,7 +207,9 @@ struct user_struct; struct task_struct { -/* these are hardcoded - don't touch */ +/* First put all info important needed in schedule() for every task + * (according to L.McVoy) in order to be cache friendly. + * Offset comments are for true 32bit archs, only. K.Garloff, 98/09/22 */ volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ unsigned long flags; /* per process flags, defined below */ int sigpending; @@ -215,19 +217,28 @@ 0-0xBFFFFFFF for user-thead 0-0xFFFFFFFF for kernel-thread */ +/* 0x10: */ struct exec_domain *exec_domain; long need_resched; + unsigned long timeout; + struct task_struct *prev_run; +/* 0x20: */ + struct task_struct *next_run; + struct task_struct *next_task; + unsigned long policy, rt_priority; +/* 0x30: */ /* various fields */ - long counter; - long priority; + long counter, priority; +/* memory management info */ + struct mm_struct *mm; /* SMP and runqueue state */ - int has_cpu; - int processor; - int last_processor; - int lock_depth; /* Lock depth. We can context switch in and out of holding a syscall kernel lock... */ - struct task_struct *next_task, *prev_task; - struct task_struct *next_run, *prev_run; + char processor, has_cpu, last_processor, lock_depth; + /* Lock depth. We can context switch in and out of holding a syscall kernel lock... */ + +/* 0x40: */ + + struct task_struct *prev_task; /* task state */ struct linux_binfmt *binfmt; @@ -258,7 +269,6 @@ struct task_struct **tarray_ptr; struct wait_queue *wait_chldexit; /* for wait4() */ - unsigned long timeout, policy, rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; unsigned long it_real_incr, it_prof_incr, it_virt_incr; struct timer_list real_timer; @@ -295,8 +305,6 @@ struct fs_struct *fs; /* open file information */ struct files_struct *files; -/* memory management info */ - struct mm_struct *mm; /* signal handlers */ spinlock_t sigmask_lock; /* Protects signal and blocked */ struct signal_struct *sig; @@ -338,9 +346,12 @@ */ #define INIT_TASK \ /* state etc */ { 0,0,0,KERNEL_DS,&default_exec_domain,0, \ +/* timeout */ 0,&init_task,&init_task,&init_task,\ +/* policy */ SCHED_OTHER,0, \ /* counter */ DEF_PRIORITY,DEF_PRIORITY, \ -/* SMP */ 0,0,0,-1, \ -/* schedlink */ &init_task,&init_task, &init_task, &init_task, \ +/* mm */ &init_mm, \ +/* SMP */ (char)0,(char)0,(char)0,(char)-1, \ +/* schedlink */ &init_task, \ /* binfmt */ NULL, \ /* ec,brk... */ 0,0,0,0,0,0, \ /* pid etc.. */ 0,0,0,0,0, \ @@ -348,7 +359,7 @@ /* pidhash */ NULL, NULL, \ /* tarray */ &task[0], \ /* chld wait */ NULL, \ -/* timeout */ 0,SCHED_OTHER,0,0,0,0,0,0,0, \ +/* it */ 0,0,0,0,0,0, \ /* timer */ { NULL, NULL, 0, 0, it_real_fn }, \ /* utime */ {0,0,0,0},0, \ /* per CPU times */ {0, }, {0, }, \ @@ -367,7 +378,6 @@ /* tss */ INIT_TSS, \ /* fs */ &init_fs, \ /* files */ &init_files, \ -/* mm */ &init_mm, \ /* signals */ SPIN_LOCK_UNLOCKED, &init_signals, {{0}}, {{0}}, NULL, &init_task.sigqueue, 0, 0, \ }

--Pd0ReVV5GZGQvF3a--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/