Re: [PATCHv 2] tcp: properly initialize tcp memory limits part 2(fix nfs regression)

From: Sergei Trofimovich
Date: Fri Mar 02 2012 - 12:45:32 EST


> > > The change looks like a typo (division flipped to multiplication):
> > >> limit = nr_free_buffer_pages() / 8;
> > >> limit = nr_free_buffer_pages()<< (PAGE_SHIFT - 10);
> >
> > Hi, thanks for the reporting. It's not a typo. It was previously:
> > sysctl_tcp_mem[1] << (PAGE_SHIFT - 7). Looks like we need to do the
> > limit check before shift the value. Please try the following patch, thanks.
>
> Still does not help. I test it by checking sha1sum of a large file over NFS
> (small files seem to work simetimes):
>
> $ strace sha1sum /gentoo/distfiles/gcc-4.6.2.tar.bz2
> ...
> open("/gentoo/distfiles/gcc-4.6.2.tar.bz2", O_RDONLY
> <HUNG>
> After a certain timeout dmesg gets odd spam:
> [ 314.848094] nfs: server vmhost not responding, still trying
> [ 314.848134] nfs: server vmhost not responding, still trying
> [ 314.848145] nfs: server vmhost not responding, still trying
> [ 314.957047] nfs: server vmhost not responding, still trying
> [ 314.957066] nfs: server vmhost not responding, still trying
> [ 314.957075] nfs: server vmhost not responding, still trying
> [ 314.957085] nfs: server vmhost not responding, still trying
> [ 314.957100] nfs: server vmhost not responding, still trying
> [ 314.958023] nfs: server vmhost not responding, still trying
> [ 314.958035] nfs: server vmhost not responding, still trying
> [ 314.958044] nfs: server vmhost not responding, still trying
> [ 314.958054] nfs: server vmhost not responding, still trying
>
> looks like bogus messages. Might be relevant to mishandled timings
> somewhere else or a bug in nfs code.

And after 120 seconds hung tasks shows it might be an OOM issue
Likely caused by patch, as it's a 2GB RAM +4GB swap amd64 box
not running anything heavy:

[ 720.798052] INFO: task sha1sum:3811 blocked for more than 120 seconds.
[ 720.798056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 720.798059] sha1sum D ffff88007bd11d40 0 3811 1 0x00000005
[ 720.798065] ffff880073de9c08 0000000000000082 ffff880073de9af8 ffff880073de9fd8
[ 720.798070] ffff880070db1620 ffff880073de9fd8 ffff880073de8000 0000000000004000
[ 720.798075] ffff880073de8000 ffff880073de9fd8 ffff8800790e0000 ffff880070db1620
[ 720.798079] Call Trace:
[ 720.798089] [<ffffffff810fdd53>] ? kfree+0x123/0x150
[ 720.798094] [<ffffffff8123227d>] ? nfs_access_free_entry+0x1d/0x30
[ 720.798097] [<ffffffff810fdd53>] ? kfree+0x123/0x150
[ 720.798101] [<ffffffff8123227d>] ? nfs_access_free_entry+0x1d/0x30
[ 720.798104] [<ffffffff81233cb8>] ? nfs_do_access+0x3a8/0x3d0
[ 720.798109] [<ffffffff8166525a>] schedule+0x3a/0x50
[ 720.798112] [<ffffffff8166390e>] __mutex_lock_slowpath+0xee/0x190
[ 720.798117] [<ffffffff81639228>] ? put_rpccred+0x48/0x130
[ 720.798120] [<ffffffff8166374e>] mutex_lock+0x1e/0x40
[ 720.798125] [<ffffffff81114927>] do_lookup+0x277/0x3a0
[ 720.798128] [<ffffffff811162b8>] do_last.clone.39+0x148/0x7e0
[ 720.798132] [<ffffffff81116a61>] path_openat+0xd1/0x3e0
[ 720.798136] [<ffffffff810604d1>] ? get_parent_ip+0x11/0x50
[ 720.798140] [<ffffffff81060675>] ? add_preempt_count+0x95/0xd0
[ 720.798144] [<ffffffff81666677>] ? _raw_spin_lock_irq+0x17/0x40
[ 720.798147] [<ffffffff81116e84>] do_filp_open+0x44/0xa0
[ 720.798151] [<ffffffff810605a5>] ? sub_preempt_count+0x95/0xd0
[ 720.798154] [<ffffffff81666371>] ? _raw_spin_unlock+0x11/0x40
[ 720.798158] [<ffffffff81123014>] ? alloc_fd+0xe4/0x130
[ 720.798163] [<ffffffff81106f7d>] do_sys_open+0xfd/0x1e0
[ 720.798169] [<ffffffff8100f290>] ? syscall_trace_enter+0xf0/0x1a0
[ 720.798172] [<ffffffff8110707c>] sys_open+0x1c/0x20
[ 720.798176] [<ffffffff81667219>] tracesys+0xd0/0xd5

--

Sergei

Attachment: signature.asc
Description: PGP signature