RE: >10% performance degradation since 2.6.18

From: Ma, Chinang
Date: Mon Jul 06 2009 - 17:17:06 EST


Trying to answer Andi's questions below

>-----Original Message-----
>From: Andi Kleen [mailto:andi@xxxxxxxxxxxxxx]
>Sent: Friday, July 03, 2009 11:19 AM
>To: Matthew Wilcox
>Cc: linux-kernel@xxxxxxxxxxxxxxx; Styner, Douglas W; Ma, Chinang; Prickett,
>Terry O; Wilcox, Matthew R; Eric.Moore@xxxxxxx; DL-MPTFusionLinux@xxxxxxx
>Subject: Re: >10% performance degradation since 2.6.18
>
>
>Matthew Wilcox <matthew@xxxxxx> writes:
>>
>> ======oprofile CPU_CLK_UNHALTED for top 30 functions
>> Cycles% 2.6.18-92.el5-op Cycles% 2.6.30
>> 70.1409 <database> 67.0207 <database>
>> 1.3556 mpt_interrupt 1.7029 mpt_interrupt
>
>It's strange that mpt_interrupt is that more costly in 2.6.30
>than in 2.6.18. I diffed 2.6.30's drivers/message/fusion/mptbase.c
>to a rhel 5.3s and they seem to be about the same.
>
>So why does it cost 0.5% more in 2.6.30?
>
>[adding MPT maintainers]
>
>> 1.1622 __blockdev_direct_IO 1.1443 kmem_cache_alloc
>
>It would be interesting to find out why kmem_cache_alloc
>is that more expensive. Either it is called more or the allocator
>is slower. Any chance of a callgraph profile run so we
>can see the callers?
>

We will try collecting the oprofile callgraph data.

>(you would need to rebuild the kernel with frame pointers)
>
>An interesting would be to simply count them on both kernels, e.g. with a
>simple systemtap script (see script at end of mail)
>
>> 0.8246 kmem_cache_free 0.8801 kmem_cache_free
>> 0.7108 schedule 0.7774 __blockdev_direct_IO
>> 0.6733 scsi_request_fn 0.7031 scsi_request_fn
>> 0.6114 kmem_cache_alloc 0.5317 __schedule
>> 0.4207 follow_hugetlb_page 0.3922 task_rq_lock
>> 0.4062 list_del 0.3629 sd_prep_fn
>> 0.3400 __switch_to 0.3504 list_del
>
>A callgraph for the list_del/list_dels would be interesting.
>Presumably that's cache misses, maybe we can do something
>here with some strategic prefetches or more batching.
>
>> 0.3339 generic_make_request 0.3382 __sigsetjmp
>
>__sigsetjmp is in glibc, isn't it?

Yes, __sigsetjmp from glibc

>
>> 0.3204 memmove 0.3270 __switch_to
>> 0.3088 __sigsetjmp 0.3257 generic_make_request
>> 0.2848 get_request 0.3116 kfree
>> 0.2804 lock_timer_base 0.2895 memmove
>
>memmove? Is that glibc or the kernel? (it would be good if that
>was marked in the listing)

Memmove is from glibc. I will filter out all user space libraries in the future report

>
>Back when I wrote the x86-64 string functions I didn't consider memmove
>too performance critical. So the x86-64 kernel memmove is actually
>not very optimized, especially the backwards case is very dumb
>and copies byte by byte.
>
>If it's that there might be some low hanging fruit here,
>on the other hand it's not a lot of CPU time. The systemtap
>script at the end measures this also.
>
>A callgraph for it would be useful if it's kernel.
>
>> 0.2572 __end_that_request_first 0.2402 aio_complete
>> 0.2567 fget_light 0.2382 mptscsih_qcmd
>> 0.2531 submit_page_section 0.2342 fget
>> 0.2428 mempool_alloc 0.2277 gup_huge_pmd
>> 0.2428 __generic_file_aio_read 0.2264 submit_page_section
>> 0.2368 touch_atime 0.2204 touch_atime
>
>Did you not run with noatime? If it runs with noatime and touch_atime
>still takes that many cycles something must be wrong.
>

Did not try noatime option yet. Will try it next time.

>Ah it might be the lock in mnt_want_write(). How about the
>appended dumb (untested) patch to first check for all the noatime
>cases before taking the lock? Even without noatime, but standard
>relatime that will hopefully eliminate the locking cost
>in most cases.
>
>> 0.2270 __aio_get_req 0.2165 __list_add
>> 0.2223 mptscsih_qcmd 0.2063 scsi_dispatch_cmd
>> 0.2198 init_request_from_bio 0.2040 lock_timer_base
>> 0.2191 fget 0.2036 irq_entries_start
>> 0.2141 device_not_available 0.2036 plist_del
>
>That plist_del is new. calltrace?

Not sure who use this. It is part of vmlinux

>
>> 0.2125 try_to_wake_up 0.2004 elv_queue_empty
>> 0.2065 mptscsih_io_done 0.2004 get_user_pages_fast
>> 0.2059 math_state_restore 0.1997 copy_user_generic_string
>> 0.2035 __errno_location 0.1925 kref_get
>> 0.2022 find_vma 0.1905 scsi_finish_command
>> 0.1967 _setjmp 0.1888 aio_rw_vect_retry
>> 0.1966 kref_get 0.1882 __errno_location
>
>That's not kernel, but it's weird that __errno_location is that
>costly. AFAIK it should be a simple thread local data lookup
>
>> 0.1952 memset 0.1849 scsi_device_unbusy
>> 0.1924 __list_add 0.1846 pick_next_highest_task_rt
>> 0.1917 copy_user_generic 0.1826 memset_c
>> 0.1907 acpi_os_read_port 0.1819 _setjmp
>> 0.1842 elv_queue_empty 0.1816 ipc_lock
>
>So it looks like ipc_lock needs so more scalability work.
>
Ipc_lock is not too high in cycles in this case. We can reduce the semaphore array size to try lower the contention.

>> 0.1809 scsi_dispatch_cmd 0.1809 mod_timer
>> 0.1808 sd_init_command 0.1800 noop_queue_empty
>> 0.1789 swiotlb_unmap_sg 0.1796 scsi_softirq_done
>> 0.1766 rw_verify_area 0.1757 scsi_run_queue
>> -----------------------------------
>
>---
>
>Systemtap script to measure some interesting metrics. Run
>with stap script.stp and then ctrl-c to get data out.
>
>---

Will try collect this data.

>
>global kmem_cache_alloc
>global kmem_cache_free
>
>probe kernel.function("kmem_cache_alloc") {
> kmem_cache_alloc++
>
>}
>
>probe kernel.function("kmem_cache_free") {
> kmem_cache_free++
>}
>
>global memmove_total, memmove_backwards
>
>probe kernel.function("memmove") {
> memmove_total++
> if (ulong_arg(1) < ulong_arg(2)) {
> memmove_backwards++
> }
>}
>
>probe end {
> printf("%u kmem_cache_allocs\n", kmem_cache_alloc)
> printf("%u kmem_cache_frees\n", kmem_cache_free)
> printf("%u memmoves, %u backwards\n", memmove_total,
>memmove_backwards);
>}
>
>
>---
>
>Only do mnt_want_write after all other options to avoid atime update are
>exhausted
>
>mnt_want_write takes a lock and can be fairly expensive. So do it as late
>as possible.
>
>Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>
>---
> fs/inode.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
>Index: linux-2.6.30-ak/fs/inode.c
>===================================================================
>--- linux-2.6.30-ak.orig/fs/inode.c
>+++ linux-2.6.30-ak/fs/inode.c
>@@ -1361,31 +1361,31 @@ void touch_atime(struct vfsmount *mnt, s
> struct inode *inode = dentry->d_inode;
> struct timespec now;
>
>- if (mnt_want_write(mnt))
>- return;
> if (inode->i_flags & S_NOATIME)
>- goto out;
>+ return;
> if (IS_NOATIME(inode))
>- goto out;
>+ return;
> if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
>- goto out;
>+ return;
>
> if (mnt->mnt_flags & MNT_NOATIME)
>- goto out;
>+ return;
> if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
>- goto out;
>+ return;
>
> now = current_fs_time(inode->i_sb);
>
> if (!relatime_need_update(mnt, inode, now))
>- goto out;
>+ return;
>
> if (timespec_equal(&inode->i_atime, &now))
>- goto out;
>+ return;
>+
>+ if (mnt_want_write(mnt))
>+ return;
>
> inode->i_atime = now;
> mark_inode_dirty_sync(inode);
>-out:
> mnt_drop_write(mnt);
> }
> EXPORT_SYMBOL(touch_atime);
>
>
>--
>ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/