Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

From: Tomasz Kłoczko
Date: Fri Sep 16 2005 - 07:43:31 EST


On Thu, 15 Sep 2005, David S. Miller wrote:

From: Tomasz Kłoczko <kloczek@xxxxxxxxxxxxxxxxxx>
Date: Thu, 15 Sep 2005 19:40:27 +0200 (CEST)

I'm just catch series of kernel messages with soft lockup detected reports
(in attachment).
It occures during store big amout of data on NFS volume.

As NIC I use now Sun Swift gigabit eth (cassini driver). Probably this is
NFS related because I'm just browse yesterday logs and simillar was also
on Sun Happy Meal.

Interesting. Can you reproduce this with SLAB poisioning disabled?
That debugging feature is extremely expensive, although it shouldn't
make the CPU stop scheduling processes for more than 10 seconds.

I wonder if the NFS daemon code needs to have some limits put on
how much cpu it consumes handling requests before it gives up the
cpu. Perhaps, it has such throttling already, I don't know.

But this not case NFS server but NSF client. During this lookups I observe rpciod takes 90-99% time of single processor. Load is between 10 and 20.

I'll also try to see if there can be some kind of sparc64 specific
issue which would cause this.

Where did you get that Cassini driver btw? It's not upstream,
although if it exists it should be.

It is avalaible on Sun pages (Copyright by Sun by it is GPLed):

http://www.sun.com/download/index.jsp?cat=Hardware%20Drivers&tab=3

Few days ago I'm talk about this driver with spot on #aurora channel and it was included during prepare next kernel package.

But continue ..

Yesterday before testing NFS I was trying utilize NIC is using only ftp/http/rcync/scp and all works correctly (without reporting lookups). ~23:00 CET after reported series of lookups system hangs completly. After restart to now I see in dmesg some known messages:

[root@boss]# dmesg | sort | uniq -c | sort -n | tail -n 2
87 svc: bad direction 268435456, dropping request <==========
285 hw tcp v4 csum failed

Also I have question about second (hw tcp v4 csum failed):

[root@boss]# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:03:BA:18:41:F9
inet addr:153.19.33.230 Bcast:153.19.33.255 Mask:255.255.255.0
inet6 addr: fe80::203:baff:fe18:41f9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2361199 errors:0 dropped:0 overruns:0 frame:0
TX packets:4060978 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:352428791 (336.1 Mb) TX bytes:4884521951 (4658.2 Mb)
Interrupt:192

As you see in both errors is "0". Is this correct reporting broken packets in kernel messages instead in RX errors ?

kloczek
--
-----------------------------------------------------------
*Ludzie nie mają problemów, tylko sobie sami je stwarzają*
-----------------------------------------------------------
Tomasz Kłoczko, sys adm @zie.pg.gda.pl|*e-mail: kloczek@xxxxxxxxxxxxxxxxxx*