Re: tty->count(1) != #fd's(2)

Doug Ledford (dledford@dialnet.net)
Fri, 28 Feb 1997 16:50:35 -0600


--------
> On Tue, 25 Feb 1997, Jon Lewis wrote:
>
> > I got this with 2.0.28 today. I don't remember getting these since very
> > shortly after upgrading to 2.0.x and the new 256 ptys.
> >
> > Feb 24 22:17:14 yoda sshd[17348]: log: Password authentication for flaboy
> > accepted.
> > Feb 24 22:17:16 yoda kernel: Warning: dev (03:b3) tty->count(1) != #fd's(2)
> > in do_tty_hangup
> > Feb 24 22:17:17 yoda sshd[17348]: log: Closing connection to 205.229.48.120
>
> I just got this again, and again it's right about the same time as an sshd
> session ending:
>
> Feb 27 16:51:46 yoda kernel: Warning: dev (03:b4) tty->count(1) !=
> #fd's(2) in do_tty_hangup
> Feb 27 16:51:46 yoda sshd[25049]: log: Closing connection to 205.229.48.42
>
> I only have the new tty/pty devices...the old ones were rm'd many months
> ago. It seems to me this must be a kernel bug sshd is exposing. I'm
> using sshd 1.2.17 (ELF) and kernel 2.0.28 on the server.

Not having looked real deep into the problem, I can at least say this much. I
think Jon happens to have a genuine problem here. Similar report from kernel
2.0.26, ssh on a serial port, during a close operation I got this message,
after which the kernels pointer into the wake_up function for the whole range
of serial ports the offending one belonged to was toast. The serial ports
happen to be Equinox SST ports, and the entire module gets toasted whenever
this happens. In my case, I've gone back to 2.0.14 (suspecting the changes to
down() or some of the changes to the semaphore code might be responsible) and
haven't seen the problem since. So, somewhere between 2.0.14 and 2.0.26 I
think this behavior was introduced. The severity of this bug is effected by
the exact serial driver in use. It would appear to be harmless to the
RocketPort and Cyclades drivers (I assume that this is on some of your
Cyclades ports Jon, and I saw the error on some of my RocketPort ports), but
lethal to the Equinox driver. I can't speak for other drivers though.

Here's the offending original problem:

Unable to handle kernel paging request at virtual address ca0d0a11
current->tss.cr3 = 007ff000, 8r3 = 007ff000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<00110604>]
EFLAGS: 00010006
eax: 0a0d0a0d ebx: 03c20eb4 ecx: 03c74018 edx: 0a0d0a0d
esi: 04863908 edi: 03c20eb4 ebp: 03c20ebc esp: 03c20ea4
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process ssh (pid: 21733, process nr: 157, stackpage=03c20000)
Stack: 0808b000 0488fcc0 02c86000 00000002 0342e414 03c74018 00000fff 04847e2c
04863908 0000001d 0808b000 02c86000 00000000 00001000 03c20f08 0014565c
012bc410 012bc410 03c20f7c 00000000 00000000 00000000 012bc42c 0808b000
Call Trace: [<0488fcc0>] [<04847e2c>] [<04863908>] [<0014565c>] [<0488fcc0>]
[<0489f400>] [<0489f400>]
[<0489f400>] [<0017067e>] [<0016d435>] [<00170a60>] [<00121217>] [<0010a615>]
Code: 8b 42 04 39 d8 74 05 89 c2 eb f5 90 89 4a 04 ff 75 f4 9d 31

Using `/boot/System.map' to map addresses to symbols.

>>EIP: 110604 <__down+6c/94>
Trace: 488fcc0
Trace: 4847e2c
Trace: 4863908
Trace: 14565c <tcp_recvmsg+3fc/40c>
Trace: 488fcc0
Trace: 489f400
Trace: 489f400
Trace: 489f400
Trace: 17067e <read_chan+2f6/6d8>
Trace: 16d435 <tty_write+dd/130>
Trace: 170a60 <write_chan>
Trace: 121217 <sys_write+13b/174>
Trace: 10a615 <system_call+55/80>

Code: 110604 <__down+6c/94> movl 0x4(%edx),%eax
Code: 110607 <__down+6f/94> cmpl %ebx,%eax
Code: 110609 <__down+71/94> je 110610 <__down+78/94>
Code: 11060b <__down+73/94> movl %eax,%edx
Code: 11060d <__down+75/94> jmp 110604 <__down+6c/94>
Code: 11060f <__down+77/94> nop
Code: 110610 <__down+78/94> movl %ecx,0x4(%edx)
Code: 110613 <__down+7b/94> pushl 0xfffffff4(%ebp)
Code: 110616 <__down+7e/94> popf
Code: 110617 <__down+7f/94> xorl %eax,(%eax)
Code: 110619 <__down+81/94> nop
Code: 11061a <__down+82/94> nop
Code: 11061b <__down+83/94> nop

Then the following errors occured the first time you try to write to the
serial port after the original error:

general protection: 0000
CPU: 0
EIP: 0010:[<001103f1>]
EFLAGS: 00010006
eax: 03c76018 ebx: c3f000ef ecx: 0486390c edx: 6f000000
esi: 0805ef2a edi: 04863908 ebp: 00407ebc esp: 00407eb0
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process mgetty (pid: 31772, process nr: 137, stackpage=00407000)
Stack: 00000001 0805ef2a 03c76019 00000fff 04847e80 0486390c 00000001 0805ef29
02eca000 00000000 00001000 00000000 7f1c0300 01000415 00131100 00000000
00000000 00000000 00000000 00407f24 03c76018 00000001 00000000 00000283
Call Trace: [<04847e80>] [<0486390c>] [<00131100>] [<0488f990>] [<0489f300>]
[<0489f300>] [<0489f300>]
[<00170b7e>] [<0016d435>] [<00170a60>] [<00121217>] [<0010a615>]
Code: 8b 02 83 f8 02 74 07 8b 02 83 f8 01 75 5f 9c 5e fa c7 02 00

Using `/boot/System.map' to map addresses to symbols.

>>EIP: 1103f1 <wake_up+35/e4>
Trace: 4847e80
Trace: 486390c
Trace: 131100 <read_dquot+10/150>
Trace: 488f990
Trace: 489f300
Trace: 489f300
Trace: 489f300
Trace: 170b7e <write_chan+11e/190>
Trace: 16d435 <tty_write+dd/130>
Trace: 170b7e <write_chan+11e/190>
Trace: 121217 <sys_write+13b/174>
Trace: 10a615 <system_call+55/80>

Code: 1103f1 <wake_up+35/e4> movl (%edx),%eax
Code: 1103f3 <wake_up+37/e4> cmpl $0x2,%eax
Code: 1103f6 <wake_up+3a/e4> je 1103ff <wake_up+43/e4>
Code: 1103f8 <wake_up+3c/e4> movl (%edx),%eax
Code: 1103fa <wake_up+3e/e4> cmpl $0x1,%eax
Code: 1103fd <wake_up+41/e4> jne 11045e <wake_up+a2/e4>
Code: 1103ff <wake_up+43/e4> pushf
Code: 110400 <wake_up+44/e4> popl %esi
Code: 110401 <wake_up+45/e4> cli
Code: 110402 <wake_up+46/e4> movl $0x90900000,(%edx)
Code: 110408 <wake_up+4c/e4> nop

I would get the second oops on each and every port exactly once, then each
port that had already presented this oops once would forever causes programs
to go into an uninterruptable state on further accesses. All of the 48xxxxxx
addresses happen to map into the Equinox module. A final interesting tidbit,
the addresses above, such as 488f990 and 489f300 all map into
eqnx_callout_driver, which corresponds with the respawning mgetty trying to
set paramters on and initialize the port since mgetty will use the callout
driver during initialization and use the regular driver for incoming calls
after CD goes high.

Now, before someone suggests I use this machine to do a binary kernel search
to find when the problem was introduced, I can't. This machine holds 144
active modem connections on it now between Equinox and Comtrol serial ports,
and my boss and my customers both would scream bloody murder if I started
rebooting the machine to start up different kernels and find the source of the
problem. I picked the kernel 2.0.14 version because it happened to be prior
to the introduction of the __down function if I remember correctly, as well as
prior to some changes to headers files that I thought might impact the Equinox
driver badly since it is pre-processed as shipped from Equinox. Under it, the
Equinox driver is infinitely more reliable.

-- 
*****************************************************************************
* Doug Ledford                      *   Unix, Novell, Dos, Windows 3.x,     *
* dledford@dialnet.net    873-DIAL  *     WfW, Windows 95 & NT Technician   *
*   PPP access $14.95/month         *****************************************
*   Springfield, MO and surrounding * Usenet news, e-mail and shell account.*
*   communities.  Sign-up online at * Web page creation and hosting, other  *
*   873-9000 V.34                   * services available, call for info.    *
*****************************************************************************