I influenced my company to install Linux servers to provide internet,
mail and remote access services to the organisation. The oops's (or
rather the freezes) are becoming more & more frequent, 2 or 3 times
a day now since I moved to the 2.2.xx series kernel using RH6.0. I am
now under a lot of pressure to get this right.
My question is --- DO I NEED TO CHANGE TO NT ??? (It's what others are
telling me!)
I posted an Oops report July 9th, and have subscribed to the mailing list
but nobody has picked up on this. It seems to me that something in the
tcp stack is broken & needs to be fixed. Looking through other postings
to the list it seems to me that there are a number of oops's which are
being caused by problems somewhere in the tcp stack.
In my case a oops, or freeze, is always preceded by the following message:
Warning: kfree_skb passed an skb still on a list (from xxxxxx)
Looking at the code it is clear that this is a bad thing - the skb structure
is freed, but the list (which points to the structure) knows nothing about
this. It does not take long before the reference to a NULL address happens.
Here is the full oops report...
=====================================================================
1. PROBLEM: Kernel 2.2.xx - Oops in tcp_recvmsg()
===========================================================================
2. Full description
Normally my system hangs with no Oops message and no information at all
on screen. I have once recieved an "oops" message, the detail of which
is included below. The sequence of events, according to the syslog file,
is always the same - A person logs in from a Windows NT box via ISDN and
picks up mail using MS Outlook (yechh!) (using ipop3d service). Perhaps
by coincidence, although a number of people login remotely to the system,
the crashes have only occurred when one particular user logs in. The
final message, regarding "kfree_skb" does not always make it into the
syslog file although the preceding sequence of events is always the same.
When the "Oops" occurred the active process was "ipop3d".
The "Oops" was received using kernel 2.2.7 #SMP. However, the problem
has occurred with 2.2.5 and 2.2.10.
NB: Since the first posting, I have noticed that the difference between
the user who crashes the system and all the others is that, in her dial-
up-networking setup on NT she had enabled "modem error control" and
"modem compression". I have switched these off - hopefully the problem
will dissapear. If it does, I will post to that effect. However, if this
is the problem I would hate for M$ to know that they can deliberately crash
a Linux server!
I have tried to include as much information as I think is relevant.
Thanks for your assistance in this matter. Please e-mail (or call) if
further information is required.
==============================
Peter Colman
email: pete@mssint.demon.co.uk
Tel: +44 1932 342 797
===========================================================================
3. Keywords kernel networking ISDN pop3 mail ipppd oops
===========================================================================
4. Version Information
4.1 ipop3d v7.59
4.2 isdn drivers (drivers/isdn): isdn-2.2-1999-06-21
4.3 isdn utils: isdn4k-utils-3.0beta2
4.4 kernel (/proc/version)
Linux version 2.2.7 (root@amber.mss) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #9 SMP Thu Jul 8 09:42:50 BST 1999
NOTE 1: Kernel's 2.2.5 and 2.2.10 have the same problem
NOTE 2: Older version of ipop3d, isdn utils and isdn drivers also.
The base system is Redhat V6
===========================================================================
5. Oops message (run through ksymoops)
Options used: -v /usr/src/linux/vmlinux (specified)
-o /usr/src/linux/modules (specified)
-k /proc/ksyms (specified)
-l /proc/modules (default)
-m /boot/System.map (specified)
-c 1 (default)
Jul 1 18:33:02 amber kernel: Warning: kfree_skb passed an skb still on a list (from c9fdd6e0).
Jul 1 18:33:10 amber kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004
Jul 1 18:33:10 amber kernel: current->tss.cr3 = 020c7000, %cr3 = 020c7000
Jul 1 18:33:10 amber kernel: *pde = 00000000
Jul 1 18:33:10 amber kernel: Oops: 0002
Jul 1 18:33:10 amber kernel: CPU: 1
Jul 1 18:33:10 amber kernel: EIP: 0010:[cleanup_rbuf+68/184]
Jul 1 18:33:10 amber kernel: EFLAGS: 00010246
Jul 1 18:33:10 amber kernel: eax: 00000000 ebx: c17f7644 ecx: c9fddb40 edx: 00000000
Jul 1 18:33:10 amber kernel: esi: c17f7600 edi: c17f7600 ebp: c17f776c esp: c228de9c
Jul 1 18:33:10 amber kernel: ds: 0018 es: 0018 ss: 0018
Jul 1 18:33:10 amber kernel: Process ipop3d (pid: 3915, process nr: 68, stackpage=c228d000)
Jul 1 18:33:10 amber kernel: Stack: c016a49e c17f7600 00000000 c17f7600 00000000 c228df74 00000400 0ea1cc64
Jul 1 18:33:10 amber kernel: 00000001 00000000 00000001 00000000 c17f76b0 c228c000 c1a7fd64 00015478
Jul 1 18:33:10 amber kernel: c017875a c17f7600 c228df74 00000400 00000000 00000000 c228df08 c1a7fd4c
Jul 1 18:33:10 amber kernel: Call Trace: [tcp_recvmsg+758/1412] [inet_recvmsg+142/168] [sock_recvmsg+58/172] [sock_read+150/160] [sys_read+210/260] [system_call+52/56]
Jul 1 18:33:10 amber kernel: Code: 89 42 04 89 10 8d 41 70 f0 ff 49 70 0f 94 c0 84 c0 74 b5 51
Warning: trailing garbage ignored on Code: line
Text: 'Code: 89 42 04 89 10 8d 41 70 f0 ff 49 70 0f 94 c0 84 c0 74 b5 51 '
Garbage: ' '
DEBUG: Oops_decode - 0: 89 42 04 movl %eax,0x4(%edx)
DEBUG: Oops_decode - 3: 89 10 movl %edx,(%eax)
DEBUG: Oops_decode - 5: 8d 41 70 leal 0x70(%ecx),%eax
DEBUG: Oops_decode - 8: f0 ff 49 70 lock decl 0x70(%ecx)
DEBUG: Oops_decode - c: 0f 94 c0 sete %al
DEBUG: Oops_decode - f: 84 c0 testb %al,%al
DEBUG: Oops_decode - 11: 74 b5 je ffffffc8 <_EIP+0xffffffc8>
DEBUG: Oops_decode - 13: 51 pushl %ecx
DEBUG: Oops_decode - ...
DEBUG: Oops_format
Code: 00000000 Before first symbol 00000000 <_IP>: <===
Code: 00000000 Before first symbol 0: 89 42 04 movl %eax,0x4(%edx) <===
Code: 00000003 Before first symbol 3: 89 10 movl %edx,(%eax)
Code: 00000005 Before first symbol 5: 8d 41 70 leal 0x70(%ecx),%eax
Code: 00000008 Before first symbol 8: f0 ff 49 70 lock decl 0x70(%ecx)
Code: 0000000c Before first symbol c: 0f 94 c0 sete %al
Code: 0000000f Before first symbol f: 84 c0 testb %al,%al
Code: 00000011 Before first symbol 11: 74 b5 je ffffffc8 <END_OF_CODE+2f78b470/????>
Code: 00000013 Before first symbol 13: 51 pushl %ecx
7 warnings issued. Results may not be reliable.
===========================================================================
6. Cannot force the problem to happen. The person who logs in is successful
most of the time. She logs in maybe 20 times a day. The problem happens once
or twice a week. Here are the last few lines out of syslog before the crash.
The lines are the same regardless of whether the Oops followed or not.
********************SYSLOG OUTPUT*********************
Jul 8 08:45:03 amber ipppd[708]: Check_passwd called with user=tina
Jul 8 08:45:03 amber ipppd[708]: user tina logged in
Jul 8 08:45:03 amber ipppd[708]: MPPP negotiation, He: No We: No
Jul 8 08:45:03 amber ipppd[708]: CCP enabled! Trying CCP.
Jul 8 08:45:03 amber ipppd[708]: CCP: got ccp-unit 0 for link 0 (protocol: 0x80fd)
Jul 8 08:45:03 amber ipppd[708]: ccp_resetci!
Jul 8 08:45:03 amber ipppd[708]: Kernel check for LZS failed
Jul 8 08:45:03 amber ipppd[708]: local IP address 192.168.1.103
Jul 8 08:45:03 amber ipppd[708]: remote IP address 192.168.22.103
Jul 8 08:45:09 amber ipppd[708]: ccp_resetci!
Jul 8 08:45:09 amber ipppd[708]: Kernel check for LZS failed
Jul 8 08:45:10 amber kernel: ippp: no decompressor defined!
Jul 8 08:45:13 amber last message repeated 2 times
Jul 8 08:45:13 amber kernel: Warning: kfree_skb passed an skb still on a list (from c016d797).
********************END OF SYSLOG OUTPUT************************
===========================================================================
7. Environment
7.1 ver_linux
-- Versions installed: (if some fields are empty or looks
-- unusual then possibly you have very old versions)
Linux amber.mss 2.2.7 #9 SMP Thu Jul 8 09:42:50 BST 1999 i686 unknown
Kernel modules 2.1.121
Gnu C egcs-2.91.66
Binutils 2.9.1.0.23
Linux C Library 2.1.1
Dynamic linker ldd (GNU libc) 2.1.1
Procps 2.0.2
Mount 2.9o
Net-tools 1.51
Kbd [option...]
Sh-utils 1.16
Modules Loaded ip_masq_vdolive ip_masq_raudio ip_masq_quake ip_masq_irc ip_masq_ftp ip_masq_cuseeme hisax isdn slhc tulip aic7xxx
[7.2] /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 451.032446
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx osfxsr
bogomips : 448.92
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 451.032446
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx osfxsr
bogomips : 450.56
[7.3] /proc/modules
ip_masq_vdolive 1000 0 (unused)
ip_masq_raudio 2624 0 (unused)
ip_masq_quake 1008 0 (unused)
ip_masq_irc 1324 0 (unused)
ip_masq_ftp 2096 0
ip_masq_cuseeme 776 0 (unused)
hisax 112832 5
isdn 90056 18 [hisax]
slhc 4132 7 [isdn]
tulip 23568 1 (autoclean)
aic7xxx 86088 5
[7.4] /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST39140W Rev: 1444
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: IBM Model: DCAS-34330W Rev: S61A
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: IBM Model: DCAS-34330W Rev: S65A
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
Vendor: IBM Model: DCAS-34330W Rev: S65A
Type: Direct-Access ANSI SCSI revision: 02
[7.5] ISDN subsystem
Command to load hisax driver...
/sbin/modprobe -v hisax id=HiSax1%HiSax2 type=27,27 protocol=2,2
[7.6] /proc/pci
Isdn cards (from /proc/pci)
Bus 0, device 17, function 0:
Network controller: AVM A1 (Fritz) (rev 2).
Medium devsel. Fast back-to-back capable. IRQ 18.
Non-prefetchable 32 bit memory at 0xe9002000 [0xe9002000].
I/O at 0xdc00 [0xdc01].
Bus 0, device 18, function 0:
Network controller: AVM A1 (Fritz) (rev 2).
Medium devsel. Fast back-to-back capable. IRQ 19.
Non-prefetchable 32 bit memory at 0xe9003000 [0xe9003000].
I/O at 0xe000 [0xe001].
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/