Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount

From: Waiman Long
Date: Fri Aug 30 2013 - 16:15:35 EST

Next message: Al Viro: "Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount"
Previous message: Stephen Warren: "Re: [PATCH 4/4] Documentation: Add device tree bindings for FreescaleFTM PWM"
In reply to: Linus Torvalds: "Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount"
Next in thread: Linus Torvalds: "Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 08/30/2013 03:33 PM, Linus Torvalds wrote:

On Fri, Aug 30, 2013 at 12:20 PM, Waiman Long<waiman.long@xxxxxx> wrote:
Below is the perf data of my short workloads run in an 80-core DL980:
Ok, that doesn't look much like d_lock any more. Sure, there's a small
amount of spinlocking going on with lockref being involved, but on the
whole even that looks more like getcwd and other random things.

Yes, d_lock contention isn't a major one in the perf profile. However, sometimes a small improvement can lead to a noticeable improvement in performance.

I do agree that getcwd() can probably be hugely optimized. Nobody has
ever bothered, because it's never really performance-critical, and I
think AIM7 ends up just doing something really odd. I bet we could fix
it entirely if we cared enough.

The prepend_path() isn't all due to getcwd. The correct profile should be

|--12.81%-- prepend_path
| |
| |--67.35%-- d_path
| | |
| | |--60.72%-- proc_pid_readlink
| | | sys_readlinkat
| | | sys_readlink
| | | system_call_fastpath
| | | __GI___readlink
| | | 0x302f64662f666c
| | |
| | --39.28%-- perf_event_mmap_event
| |
| --32.65%-- sys_getcwd
| system_call_fastpath
| __getcwd

Yes, the perf subsystem itself can contribute a sizeable portion of the spinlock contention. In fact, I have also applied my seqlock patch that was sent a while ago to the test kernel in order to get a more accurate perf profile. The seqlock patch will allow concurrent d_path() calls without one blocking the others. In the 240-core prototype machine, it was not possible to get an accurate perf profile for some workloads because more than 50% of the time was spent in spinlock contention due to the use of perf. An accurate perf profile can only be obtained in those cases by applying my lockref and seqlock patches. I hope someone will have the time to review my seqlock patch to see what additional changes will be needed. I really like to see it merged in some form to 3.12.

I just wonder if it's even worth it (I assume AIM7 is something HP
uses internally, because I've never really heard of anybody else
caring)

Our performance group is actually pretty new. It was formed 2 years ago and we began actively participating in the Linux kernel development just in the past year.

We use the AIM7 benchmark internally primarily because it is easy to run and cover quite a lot of different areas in the kernel. We are also using specJBB and SwingBench for performance benchmarking problem. We are also trying to look for more benchmarks to use in the future.

Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Al Viro: "Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount"
Previous message: Stephen Warren: "Re: [PATCH 4/4] Documentation: Add device tree bindings for FreescaleFTM PWM"
In reply to: Linus Torvalds: "Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount"
Next in thread: Linus Torvalds: "Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]