Re: Excessive System time in MSP kernel.

Riley Williams (rhw@MemAlpha.CX)
Wed, 28 Jul 1999 12:40:27 +0100 (GMT)


Hi there.

> I send you this after the following corrspondence:

>>> We have a problem on a 2-processor Pentium system. The system
>>> has 512 Mb memory. It is used to make large FEM computations. In
>>> general it performs well. There is one situation where it
>>> degrades fast: If we run out test suite, which consists of about
>>> 1500 small jobs, then after say 150 jobs, the system is mainly
>>> running in system mode ( orange in xosview ). This can be solved
>>> by running a small memory allocation (and initializing)job. This
>>> seems to 'flush' memory. But after the next 150 jobs, the same
>>> thing happens. We are Running RedHat 6.0 with kernel 2.5.5.
>>> Later kernels behave the same. What are we missing? I remember
>>> something about update or kflushd processes...

>> You can check where in the kernel it uses the excessive time by
>> booting with profile=2 on the lilo command line; and using the
>> /sbin/readprofile tool to read the profiling statistics. Clear
>> it before your run with echo > /proc/profile, afterwards run
>> readprofile [make sure it finds the correct System.map for your
>> kernel] and send the result to linux-kernel@vger.rutgers.edu

>> One possibility is that your work load is harmed by the kernel
>> lock; if this is true "stext_lock" should show prominently on
>> the profile.

> readprofile produces the following results:

===8<=== Cut and fed through sort, then reinserted ===>8==

> 1 tcp_do_sendmsg 0.0006
> 2 load_elf_binary 0.0006
> 1 sys_select 0.0007
> 1 ip_queue_xmit 0.0010
> 1 kmem_cache_grow 0.0010
> 1 rw_swap_page_base 0.0011
> 2 read_chan 0.0011
> 1 tcp_ack 0.0012
> 1 tcp_v4_rcv 0.0012
> 2 tcp_rcv_established 0.0013
> 1 add_request 0.0015
> 1 bdflush 0.0015
> 1 sys_dup2 0.0015
> 1 ip_local_deliver 0.0016
> 1 kmem_cache_reap 0.0017
> 1 udp_rcv 0.0017
> 1 try_to_swap_out 0.0020
> 2 ext2_read_inode 0.0020
> 1 array_read 0.0021
> 1 __scm_send 0.0021
> 2 sys_wait4 0.0021
> 2 ext2_add_entry 0.0024
> 1 opost_block 0.0025
> 1 put_dirty_page 0.0025
> 1 tcp_write_xmit 0.0025
> 2 ext2_free_blocks 0.0025
> 1 sys_mprotect 0.0026
> 1 sys_sigprocmask 0.0027
> 1 trunc_dindirect 0.0027
> 2 ip_rcv 0.0027
> 1 avl_rebalance 0.0028
> 1 tcp_v4_do_rcv 0.0028
> 1 change_protection 0.0030
> 3 tcp_transmit_skb 0.0030
> 1 read_exec 0.0032
> 1 setup_sigcontext 0.0032
> 1 __wait_on_page 0.0032
> 1 ip_route_input 0.0034
> 1 ip_create 0.0035
> 1 sync_inodes 0.0035
> 3 ext2_update_inode 0.0035
> 1 tcp_data 0.0036
> 5 ip_fw_check 0.0036
> 5 tcp_recvmsg 0.0037
> 1 proc_get_inode 0.0038
> 1 setup_arg_pages 0.0038
> 1 tcp_poll 0.0039
> 2 load__block_bitmap 0.0041
> 1 alloc_skb 0.0042
> 2 flush_old_exec 0.0043
> 2 sys_rt_sigprocmask 0.0043
> 1 sys_unlink 0.0045
> 1 open_dentry 0.0046
> 1 copy_thread 0.0047
> 1 tcp_ioctl 0.0047
> 1 __wait_on_inode 0.0047
> 1 dev_queue_xmit 0.0050
> 4 create_elf_tables 0.0050
> 1 __pollwait 0.0051
> 3 ext2_bmap 0.0051
> 3 iput 0.0052
> 1 max_select_fd 0.0053
> 1 sys_sysinfo 0.0053
> 1 ext2_check_dir_entry 0.0054
> 1 lookup_swap_cache 0.0054
> 1 netif_rx 0.0054
> 2 get_root_array 0.0054
> 4 do_exit 0.0055
> 1 cleanup_rbuf 0.0056
> 1 get_empty_inode 0.0056
> 1 sock_recvmsg 0.0056
> 1 eth_type_trans 0.0057
> 1 __tcp_select_window 0.0057
> 3 ll_rw_block 0.0057
> 1 sock_sendmsg 0.0058
> 1 interruptible_sleep_on 0.0060
> 1 tcp_send_ack 0.0060
> 2 __global_cli 0.0060
> 4 copy_files 0.0060
> 2 net_timer 0.0062
> 2 get_pid 0.0063
> 1 csum_partial 0.0064
> 1 sock_write 0.0064
> 1 write_profile 0.0064
> 1 file_send_actor 0.0066
> 1 skb_clone 0.0066
> 2 get_new_inode 0.0067
> 1 eth_copy_and_sum 0.0069
> 1 inet_sendmsg 0.0069
> 1 sys_getcwd 0.0071
> 1 sys_newlstat 0.0071
> 2 unmap_fixup 0.0071
> 2 search_binary_handler 0.0072
> 3 inet_ioctl 0.0073
> 1 free_pgtables 0.0074
> 1 get_pte_slow 0.0074
> 1 ext2_get_group_desc 0.0076
> 17 do_fork 0.0077
> 4 net_bh 0.0080
> 1 do_gettimeofday 0.0081
> 1 sys_execve 0.0081
> 1 iget 0.0083
> 12 make_request 0.0083
> 3 do_sigaction 0.0085
> 1 clear_inode 0.0089
> 1 free_uid 0.0093
> 1 get_unmapped_area 0.0093
> 1 get_unmapped_area 0.0093
> 1 qdisc_restart 0.0093
> 2 sys_rt_sigaction 0.0093
> 1 memcpy_toiovec 0.0096
> 2 ide_wait_stat 0.0098
> 1 tcp_v4_send_check 0.0100
> 3 try_to_read_ahead 0.0103
> 7 do_rw_disk 0.0104
> 2 ip_output 0.0106
> 9 open_namei 0.0106
> 27 ext2_new_block 0.0107
> 5 kfree 0.0108
> 4 get_irq_list 0.0109
> 16 ide_do_request 0.0113
> 1 del_timer 0.0114
> 1 tcp_write_space 0.0119
> 3 filp_open 0.0125
> 3 do_check_pgt_cache 0.0132
> 1 sock_wmalloc 0.0139
> 2 locks_remove_posix 0.0139
> 1 d_rehash 0.0147
> 2 try_to_free_buffers 0.0152
> 4 exit_mmap 0.0152
> 2 ext2_file_lseek 0.0156
> 9 do_munmap 0.0163
> 16 do_mmap 0.0165
> 1 device_not_available 0.0167
> 1 d_alloc_root 0.0167
> 1 is_read_only 0.0167
> 10 exit_notify 0.0170
> 5 cp_new_stat 0.0171
> 5 sys_access 0.0171
> 2 init_private_file 0.0172
> 7 kmalloc 0.0173
> 1 schedule_tail 0.0179
> 3 __kfree_skb 0.0179
> 3 get_unused_buffer_head 0.0187
> 1 mmput 0.0192
> 1 udp_check 0.0192
> 8 copy_strings 0.0192
> 6 get_empty_filp 0.0197
> 7 get_unused_fd 0.0199
> 2 ipfw_output_check 0.0200
> 2 ipfw_output_check 0.0200
> 2 __mark_inode_dirty 0.0208
> 7 remove_from_queues 0.0213
> 2 do_hwgroup_request 0.0217
> 2 filp_close 0.0217
> 2 sock_wake_async 0.0217
> 10 sync_old_buffers 0.0221
> 1 end_buffer_io_sync 0.0227
> 1 math_state_restore 0.0227
> 1 sock_def_readable 0.0227
> 2 unplug_device 0.0227
> 6 old_mmap 0.0227
> 16 trunc_indirect 0.0231
> 2 make_pages_present 0.0238
> 11 inode_getblk 0.0239
> 4 startup_32 0.0244
> 9 merge_segments 0.0247
> 1 insert_file_free 0.0250
> 2 pfifo_fast_dequeue 0.0250
> 2 cached_lookup 0.0263
> 2 put_unused_buffer_head 0.0263
> 8 shrink_mmap 0.0263
> 2 ide_set_handler 0.0278
> 10 insert_vm_struct 0.0284
> 4 sys_close 0.0294
> 5 find_vma_prev 0.0298
> 3 sys_munmap 0.0300
> 9 sys_brk 0.0304
> 3 free_page_and_swap_cache 0.0312
> 3 ipfw_input_check 0.0312
> 4 vm_enough_memory 0.0312
> 5 wake_up_process 0.0312
> 7 proc_lookup 0.0312
> 6 __switch_to 0.0319
> 11 d_alloc 0.0327
> 22 get_kstat 0.0327
> 25 ext2_find_entry 0.0327
> 7 ext2_permission 0.0330
> 2 flush_signal_handlers 0.0333
> 5 generic_readpage 0.0347
> 1 ext2_release_file 0.0357
> 1 release_segments 0.0357
> 1 sock_def_wakeup 0.0357
> 2 __const_udelay 0.0357
> 10 sys_llseek 0.0362
> 37 timer_bh 0.0369
> 48 ext2_file_write 0.0373
> 20 ext2_getblk 0.0382
> 25 block_getblk 0.0383
> 25 block_getblk 0.0383
> 16 kmem_cache_free 0.0392
> 13 dput 0.0396
> 5 generic_file_read 0.0403
> 3 clear_user 0.0417
> 3 reserved_lookup 0.0417
> 6 __bforget 0.0417
> 21 number 0.0423
> 13 sys_write 0.0445
> 39 vsprintf 0.0445
> 5 smp_flush_tlb 0.0446
> 7 sys_newfstat 0.0449
> 2 d_instantiate 0.0455
> 4 __namei 0.0455
> 74 do_generic_file_read 0.0481
> 14 insert_into_queues 0.0493
> 1 __get_user_1 0.0500
> 3 in_group_p 0.0500
> 5 update_atime 0.0500
> 16 proc_root_lookup 0.0533
> 3 find_inode 0.0536
> 8 getname 0.0541
> 2 ext2_open_file 0.0556
> 4 __fput 0.0556
> 45 do_page_fault 0.0565
> 31 getblk 0.0566
> 4 __global_restore_flags 0.0588
> 14 sys_read 0.0603
> 31 copy_page_range 0.0605
> 1 get_fast_time 0.0625
> 2 ret_from_intr 0.0625
> 5 __wake_up 0.0625
> 62 schedule 0.0651
> 4 init_buffer 0.0667
> 12 sys_open 0.0698
> 2 sock_wfree 0.0714
> 36 lookup_dentry 0.0726
> 3 put_page 0.0750
> 19 permission 0.0766
> 36 handle_mm_fault 0.0783
> 2 __get_user_4 0.0833
> 6 fput 0.0833
> 29 kmem_cache_alloc 0.0833
> 10 real_lookup 0.0862
> 6 __generic_copy_to_user 0.0938
> 2 is_bad_inode 0.1000
> 14 do_follow_link 0.1129
> 23 clear_page_tables 0.1173
> 8 set_writetime 0.1250
> 29 do_no_page 0.1272
> 18 sys_newstat 0.1286
> 24 refile_buffer 0.1304
> 61 zap_page_range 0.1374
> 44 free_pages 0.1618
> 164 brw_page 0.1798
> 164 brw_page 0.1798
> 16 strncpy_from_user 0.2000
> 4 ret_from_sys_call 0.2353
> 32 handle_IRQ_event 0.2353
> 135 do_wp_page 0.2446
> 9 get_hardblocksize 0.2500
> 21 __udelay 0.2500
> 215 filemap_nopage 0.2687
> 95 ext2_alloc_block 0.2861
> 31 find_vma 0.3100
> 5 ret_with_reschedule 0.3125
> 151 __get_free_pages 0.3199
> 18 __brelse 0.3462
> 3 bad_get_user 0.3750
> 3 update_process_times 0.3750
> 21 error_code 0.4038
> 27 system_call 0.4821
> 18 get_hash_table 0.5000
> 118 __free_page 0.5086
> 54 find_buffer 0.7105
> 11 restore_all 0.7333
> 64 mark_buffer_uptodate 0.7619
> 285 update_vm_cache 0.9375
> 169 do_bottom_half 1.0562
> 48 strlen_user 1.0909
> 449 si_swapinfo 2.8062
> 792 si_meminfo 6.1875
> 647 file_read_actor 8.0875
> 110 page_fault 9.1667
> 1178 do_anonymous_page 9.2031
> 694 __generic_copy_from_user 10.8438
> 16751 d_lookup 80.5337

> 23934 total 0.0276

===8<=== Cut and fed through sort, then reinserted ===>8==

> It seems to me that a lot of time is spent in d_lookup....

It certainly appears to be the case - and __generic_copy_from_user
appears to be rather slow as well. Also, sorting just by the first
column, here's the tail end of the list...

> 169 do_bottom_half 1.0562
> 215 filemap_nopage 0.2687
> 285 update_vm_cache 0.9375
> 449 si_swapinfo 2.8062
> 647 file_read_actor 8.0875
> 694 __generic_copy_from_user 10.8438
> 792 si_meminfo 6.1875
> 1178 do_anonymous_page 9.2031
> 16751 d_lookup 80.5337
> 23934 total 0.0276

Six of the seven with third-column values greater than 2.0 are in that
list as well (the exception is page_fault), so the latter probably
lists all the items of interest...

I'm no expert on what to do from these lists though...but the experts
who are can almost certainly be found here...

> x_osview tells me that I use 496M, of which more than 300M is
> USED+SHAR, the rest is evenly divided between BUFF and CACHE

> The system has 512M. My impression is that my previous system
> with only 128M did run better ( in the case of running 1500
> small test jobs! ). It looks like there is so much kept in
> memory, that searching in it for something useful is taking a
> lot of time.

> I want to express my gratitude for any one who spent time on
> this. Also for linux we must say: Let's make things better!

Best wishes from Riley.

+----------------------------------------------------------------------+
| There is something frustrating about the quality and speed of Linux |
| development, ie., the quality is too high and the speed is too high, |
| in other words, I can implement this XXXX feature, but I bet someone |
| else has already done so and is just about to release their patch. |
+----------------------------------------------------------------------+
* ftp://ftp.MemAlpha.cx/pub/rhw/Linux
* http://www.MemAlpha.cx/kernel.versions.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/