Sparc SMP 2.2.x spin_lock - More Data...

Robert Dinse (nanook@eskimo.com)
Mon, 6 Sep 1999 01:31:17 -0700 (PDT)



As many of you know I've been trying to get ANY 2.2.x kernel to run stable
on a multi-CPU Sparc platform. Specifically, I've got two SS-10's with quad
Ross Hypersparc 100Mhz CPU's. One machine has 256MB of RAM and a 2nd disk
controller (just the 10mb + ethernet card that Sun sold), and the other machine
has 384MB of RAM and only the built-in disk controller. I used to have a Sun
4/670MP which I attempted to run a 2.2.x kernel on and it exhibited the same
problem but that machine has since been retired so I can no longer test any
fixes on it.

The first machine is a news server and is running INN 1.7.2 (yes I know an
ancient version), the second machine a web server running Apache 1.3.3p3.

Both machines were stable with 2.0.35 kernel, except that during the boot
up procedure they would sometimes hang during fsck or mounting of the file
systems, but once they got past that they were always stable.

With 2.2.x they no longer exhibit the hang during fsck or mounting of file
systems, but after some period of time crash with these spin_lock messages, the
details of which I have posted previously but I will post again below this
message.

I am currently running 2.2.10, because both 2.2.11 and 2.2.12 had other
severe stability problems, 2.2.11 I never succeeded in getting to run more than
two hours without a crash, and 2.2.12 the longest it's survived is about eight
hours before it totally ran the machine out of memory, including swap.

2.2.10, I've got Anton Blanchard, I'll quote from Anton:

"There are some races during context creation and switching on SMP
sparc32. This fixes one problem and if you have contexts > NR_TASKS then
you probably wont see the others. (There are problems when we need to
steal a context from antother cpu)".

This patch DOES really help, it makes the difference between several hours
of uptime and several days. On these machines, it is my understanding there
are 4096 contexts (experts - correct me if I'm wrong), and NR_TASKS is
currently set to 512.

And then I've got Andrea's schedualer patch, which did not noticably
reduce the occurances of the spin_lock lock-up, but did very noticably improve
the responsiveness of the machine and cut the compile times almost in half for
a kernel with make -j4.

And then I have a patch from David Miller that adds some nop's in the
Sparc assembly code after loading the psr at various locations. That patch if
it made any difference it was very minor. Given the semi-random nature of the
lock-ups it's kind of difficult to tell.

Then tonight for shits and grins I tried to apply Mingo's low latency
patch. Initially I wasn't able to get it to compile on Sparc, and then at his
suggestion I backed out changes to drivers/char, linux/console_struct.h,
linux/selection.h, and linux/vt_buffer.h and then it compiled fine.

With this patch in place the machine absolutely flies interactively! I've
never had a Unix box respond like that, I mean instantaneous keyboard response.
Audio and what not aside (I haven't started to play with that under Linux yet),
the patch is very nice just from a standpoint of how the machine responds.

But.... The spin_lock thing happens within one minute of coming up in
multi-user mode with the patch in place, which is a real bummer, but maybe it
will help chase this problem down.

Another clue, with this patch in place, an fsck, particularly on a file
system on a stripped partition, will lock it up within two seconds.

There are two common ways it locks, the most common way prints the
following over and over again until I L1-A the machine, and then the weird
thing is even though I get the PROM prompt, it keeps printing these messages
out in yellow in the background and they scroll while the PROM text in black on
white, does not. It's as if L1-A is only halting some of the CPU's and others
continue to grind in the spin_lock thing. If I type "reset" then it stops
printing the spin_lock after it resets.

The common message looks like this:

spin_lock (f0156c4c) CPU #1 stuck at f002dc78, owner PC (f002dc78)
.
.
.
ad nausium

An alternative alternates:

spin_lock (f0156c4c) CPU #1 stuck at f002dc78, owner PC (f002dc78)
spin_lock_irqsave (f015fa78) CPU #0 stuck at f0114e00, owner PC (f00d26dc):CPU
#1

When this happens, these two messages alternate indefinitely.

In the System.map:

-----

From first message:

f0156c4c D kernel_flag

f002dc78 falls between two symbols:

f002dbd4 T patchme_store_new_current
... f002dc78 ... no exact match
f002df20 T __wake_up

_____

From the second message:

f015fa34 D read_ahead
... f015fa78 ... no exact match
f015fe30 D blk_size

f0114dcc t esp_intr
... f0114e00 ... no exact match
f0114ea0 t esp_intr_4d

f00d2564 T add_request
... f00d26dc ... no exact match
f00d2918 T make_request

I can't help but wonder if it's relavent that this one seems to involve
some signals in the esp device driver and this lock-up happens within a couple
of seconds of starting an fsck (something disk I/O intensive).

I've got this kernel configured really minimally, that is no drivers or
capabilities compiled in that I don't need for this machine and application.
Configuration follows:

#
# Automatically generated by make menuconfig: don't edit
#

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y

#
# Loadable module support
#
# CONFIG_MODULES is not set

#
# General setup
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
# CONFIG_AP1000 is not set
CONFIG_SMP=y
# CONFIG_SUN4 is not set
# CONFIG_PCI is not set

#
# Console drivers
#
CONFIG_PROM_CONSOLE=y
CONFIG_FB=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FB_SBUS=y
CONFIG_FB_CGSIX=y
# CONFIG_FB_BWTWO is not set
# CONFIG_FB_CGTHREE is not set
# CONFIG_FB_TCX is not set
# CONFIG_FB_CGFOURTEEN is not set
# CONFIG_FB_LEO is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_FBCON_ADVANCED=y
CONFIG_FBCON_MFB=y
# CONFIG_FBCON_CFB2 is not set
# CONFIG_FBCON_CFB4 is not set
CONFIG_FBCON_CFB8=y
# CONFIG_FBCON_CFB16 is not set
# CONFIG_FBCON_CFB24 is not set
# CONFIG_FBCON_CFB32 is not set
# CONFIG_FBCON_AFB is not set
# CONFIG_FBCON_ILBM is not set
# CONFIG_FBCON_IPLAN2P2 is not set
# CONFIG_FBCON_IPLAN2P4 is not set
# CONFIG_FBCON_IPLAN2P8 is not set
# CONFIG_FBCON_MAC is not set
# CONFIG_FBCON_VGA is not set
CONFIG_FBCON_FONTWIDTH8_ONLY=y
CONFIG_FONT_SUN8x16=y
# CONFIG_FBCON_FONTS is not set
CONFIG_SBUS=y
CONFIG_SBUSCHAR=y
CONFIG_SUN_MOUSE=y
CONFIG_SERIAL=y
CONFIG_SUN_SERIAL=y
CONFIG_SERIAL_CONSOLE=y
CONFIG_SUN_KEYBOARD=y
CONFIG_SUN_CONSOLE=y
CONFIG_SUN_AUXIO=y
CONFIG_SUN_IO=y
# CONFIG_SUN_OPENPROMIO is not set
CONFIG_SUN_MOSTEK_RTC=y
# CONFIG_SUN_BPP is not set
# CONFIG_SUN_VIDEOPIX is not set
# CONFIG_SUN_AURORA is not set
# CONFIG_SPARCAUDIO is not set
# CONFIG_SPARCAUDIO_AMD7930 is not set
# CONFIG_SPARCAUDIO_CS4231 is not set
# CONFIG_SPARCAUDIO_DBRI is not set
# CONFIG_SPARCAUDIO_DUMMY is not set
# CONFIG_SUN_OPENPROMFS is not set
CONFIG_NET=y
CONFIG_SYSVIPC=y
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_SYSCTL=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_ELF=y
# CONFIG_BINFMT_MISC is not set
# CONFIG_BINFMT_JAVA is not set

#
# Floppy, IDE, and other block devices
#
CONFIG_BLK_DEV_FD=y
CONFIG_BLK_DEV_MD=y
# CONFIG_MD_LINEAR is not set
CONFIG_MD_STRIPED=y
# CONFIG_MD_MIRRORING is not set
# CONFIG_MD_RAID5 is not set
# CONFIG_BLK_DEV_RAM is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_NBD is not set

#
# Networking options
#
# CONFIG_PACKET is not set
# CONFIG_NETLINK is not set
# CONFIG_FIREWALL is not set
# CONFIG_FILTER is not set
CONFIG_UNIX=y
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_IP_ROUTER is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
CONFIG_IP_ALIAS=y
CONFIG_SYN_COOKIES=y
# CONFIG_INET_RARP is not set
CONFIG_SKB_LARGE=y
# CONFIG_IPV6 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_BRIDGE is not set
# CONFIG_LLC is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_FASTROUTE is not set
# CONFIG_NET_HW_FLOWCONTROL is not set
# CONFIG_CPU_IS_SLOW is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# SCSI support
#
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=y
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_SCSI_MULTI_LUN is not set
CONFIG_SCSI_CONSTANTS=y

#
# SCSI low-level drivers
#
CONFIG_SCSI_SUNESP=y
# CONFIG_SCSI_QLOGICPTI is not set

#
# Fibre Channel support
#
# CONFIG_FC4 is not set
# CONFIG_FC4_SOC is not set
# CONFIG_FC4_SOCAL is not set
# CONFIG_SCSI_PLUTO is not set
# CONFIG_SCSI_FCAL is not set

#
# Network device support
#
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
CONFIG_SUNLANCE=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNBMAC is not set
# CONFIG_SUNQE is not set
# CONFIG_MYRI_SBUS is not set

#
# Unix98 PTY support
#
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=256

#
# Filesystems
#
CONFIG_QUOTA=y
# CONFIG_AUTOFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_FAT_FS is not set
# CONFIG_MSDOS_FS is not set
# CONFIG_UMSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_ISO9660_FS is not set
# CONFIG_JOLIET is not set
# CONFIG_MINIX_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_HPFS_FS is not set
CONFIG_PROC_FS=y
CONFIG_DEVPTS_FS=y
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_EXT2_FS=y
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

#
# Network File Systems
#
# CONFIG_CODA_FS is not set
CONFIG_NFS_FS=y
CONFIG_NFSD=y
# CONFIG_NFSD_SUN is not set
CONFIG_SUNRPC=y
CONFIG_LOCKD=y
# CONFIG_SMB_FS is not set
# CONFIG_NCP_FS is not set

#
# Partition Types
#
CONFIG_BSD_DISKLABEL=y
# CONFIG_MAC_PARTITION is not set
CONFIG_SMD_DISKLABEL=y
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_NLS is not set

#
# Watchdog
#
CONFIG_SOFT_WATCHDOG=y

#
# Kernel hacking
#
# CONFIG_MAGIC_SYSRQ is not set

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/