A good thing to do, thanks for doing it.
: We use the same method as lmbench
Funny, that's how I would have done it :-)
: (one array is allocated on each thread)
Are you sure it is N arrays, 1 / thread, and these are not shared between
: size (Kb) | number of procs | time (microseconds)
: threads | processes
: 1 | 2 | 4 | 5
This makes sense, everything fits in the caches.
: 128 | 2 | 8 | 205
This would make sense if the array is shared between all threads and/or
your cache is large enough to hold both arrays and they don't conflict
in the second level cache (think direct mapped, no page coloring).
: 128 | 64 | 21 | 1063
This is the interesting one. You have a working set of 128K * 64 processes
or 8MB. If the array really was shared between all threads, then the 21 usec
number makes sense, you're increasing slightly due to more contexts to deal
with, but it is basically not going up. The 1063 number makes sense because
you are touching much more data and it can't possibly fit in the cache.
So let's see what the number should be. If you really did it like I did,
then your baseline was the 2 process case, including the summing of the
data in a hot cache. So the only difference should be 128K of cache misses.
On Intel, that's 32 bytes/miss, so that is 4096 misses. Let's guess .2 usecs
per miss, that's close enough, so that is 819 usecs for the cache misses.
Sounds about right to me.
So I suspect that you are really using a shared VM with a shared array at the
same address. Please check (or send me the code and I'll check) and if that
is not the case, then absolutely post again and let's figure out what is
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to firstname.lastname@example.org