Re: Quad core CPUs loaded at only 50% when running a CPU and mmapintensive multi-threaded task

From: TÃrÃk Edwin
Date: Mon Aug 25 2008 - 06:22:35 EST


On 2008-08-25 13:02, Peter Zijlstra wrote:
On Mon, 2008-08-25 at 12:49 +0300, TÃrÃk Edwin wrote:
On 2008-08-25 12:23, Peter Zijlstra wrote:
On Mon, 2008-08-25 at 10:04 +0300, edwin wrote:
Peter Zijlstra wrote:
On Mon, 2008-08-25 at 00:01 +0300, TÃrÃk Edwin wrote:
Hi Ingo,

When I run clamd (www.clamav.net), I can only get to load my CPU 50% (according to top), and disks at 30% (according to iostat -x 3), regardless how many threads I set (I tried 4, 8, 16, 32).
Can you share your .config, and prehaps tell what kernel version did
work for you?
Sorry, I forgot to include the .config, its at the end of this mail (the cfs debug info output included the .config though)

Well, I just bought this new box, so there isn't a kernel version that I know that worked on this hardware (but I am trying to boot some older versions now).
However on my previous box (Athlon64, non-SMP) I have never seen such a problem (that the CPU is loaded only 50% with clamd) and I've been
running 2.6.26 and 2.6.27-rc4 there too.

Details below, short summary here:
2.6.24: WORKS, clamd 400% CPU, testprogram runs in 27.4 seconds, 67% CPU load; and 28.5 seconds w/o setting affinity
2.6.25+: DOES NOT WORK, clamd 200%-300% CPU, testprogram runs in 38-40 seconds, 48-48% CPU load, and 47-56 seconds w/o setting affinity

Debian has 2.6.18, 2.6.22, 2.6.24, 2.6.25, 2.6.26.
2.6.22 won't work with my lvm, so I can't boot that, so I tried 2.6.24:

2.6.24 doesn't have sched_debug enabled in the stock kernel unfortunately, but the output of cfs-debug-info.sh is available here, maybe it contains some useful info:
http://edwintorok.googlepages.com/testrun-1219645937.tar.gz

Is this enough info for you to reproduce the problem, or do you want me to try and bisect?
No, I think I know what's going on..

mmap() and munmap() need to take the mmap_sem for writing (since they
modify the memory map) and you let each thread (one for each cpu) take
that process wide lock, twice, for a million times.
Are you referring to the mmap_sem lock, or my mutex lock around all_thread_time?

mmap_sem, its process wide, and your test prog bangs on it like there's
no tomorrow.

Well, the real program (clamd) that this testprogram tries to simulate does an mmap for almost every file, and I have lots of small files.
6.5G, 114122 files, average size 57k.

I'll run latencytop again, last time it has showed 100ms - 500ms latency for clamd, and it was about mmap, I'll provide you with the exact output.

Guess what happens ;-)
So the problem is that doing mmap() doesn't scale well with multiple threads, because there is contention on mmap_sem?

Indeed.

Why did 2.6.24 seem to work better?

Perhaps the scheduler overhead did increase, can you try:

echo NO_HRTICK > /debug/sched_features

(after mounting debugfs on /debug, or adjusting the path to where you do
have it mounted)

That might cause some overhead on very high context switch rates.

No difference, and turning off the other features from sched_features doesn't seem to help either.

Best regards,
--Edwin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/