On Wed, 2008-10-22 at 15:55 -0700, Harry Edmon wrote:The problem is that it is not hanging. The processes are running through a lot of systems calls. It is just that the system time jumps up to over 95% on all 8 processors with 2.6.26 and beyond. I never see that with 220.127.116.11. I will try looking again and see if there are certain calls that are taking a lot of time.
Trond Myklebust wrote:
On Wed, 2008-10-22 at 08:35 -0700, Harry Edmon wrote:Then how do you explain the the large system time used with 2.6.26 and beyond? Is it some other patch I should be looking at?
I have a dual quad-core Xeon system running software (http://www.unidata.ucar.edu/software/ldm) that relays and processes weather data through RPC calls, keeping a queue of data in a memory mapped file. Up until 2.6.26 the system has run just fine (for example 18.104.22.168). But starting with 2.6.26 through 22.214.171.124 the system runs into a problem after approximately 24 hours. The symptom is that the processing slows down to a crawl. Using "top" I can see that the System time is up over 90%, with almost no User and Wait time. If I stop and restart the software, most of the time it gets better - but sometimes it takes a reboot to fix the problem. I have an identical system that does just processing and ingesting data from remote systems, and it does not have this problem. I have tried a number of different kernel configurations, but they all show the same problem.The kernel sunrpc interface is not exported to user land: the glibc code
I suspect a problem with SUNRPC. I notice that there were a large number of SUNRPC patches in 2.6.26. I am looking for suggestions on how to pin down which patches are causing the problem. Are there ways to figure where in the kernel the time is being spent? I am will to work on isolating the problem, but I need some suggestions on the best way to do it given the large number of SUNRPC patches in 2.6.26 and the fact that each experiment takes a day.
uses its own, entirely separate implementation of sunrpc.
I cannot therefore see, how your application's RPC calls can be affected
by kernel sunrpc changes.
I'm not explaining it. I'm saying that nothing outside the kernel NFS
and NLM code uses the kernel sunrpc implementation. Your userland RPC
calls are using glibc's implementation of sunrpc. Those are unaffected
by patches to the kernel sunrpc layer.
If you are seeing a hang, then I suggest you start by using the strace
utility to figure out which system call is actually involved.