SMP sock_select problem (GPF and such)

Rob Riggs (rob@pangalactic.org)
Wed, 23 Jun 1999 13:09:03 -0600


This is a multi-part message in MIME format.
--------------C9BEFE783FA9FD1067D432F3
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

There is a problem with the select code in 2.0.36 (and probably 2.0.37
as well, but this hasn't been tested) that results in kernel GPF on
SMP machines. There also seems to exist a related problem in 2.2.x, that
results in processes stuck in skb_recv_datagram rather than a GPF. This
problem has been duplicated on numerous machines.

There is a Python script at ftp://ftp.pangalactic.org/pub/outgoing/pytest
that will show these problems. This script requires Python-1.5.2/glibc to
run. It just spawns up to 50 threads all doing gethostbyname_r(). It
will do 1000 calls with up to 50 outstanding requests at a time.

On a 2.0.36 SMP box pytest will die with a segmentation fault. A view of
the syslog should show a GPF in sock_select. An oops log is attached.

Under 2.2.x SMP the script will never finish. A 'ps awl | grep python'
will show a number of processes stuck in skb_recv_datagram. My best
guess as to what happens is that the gethostbyaddr_r() sends a request
and then waits in select() for a response. Something is triggering
select() improperly and gethostbyaddr_r() then calls recv() when,
in fact, there isn't any data and quite probably the initial request
was actually lost. And since it's UDP, there is no guarantee of
delivery and therefore no guarantee of a response. I've verified
some of this with gdb.

I've looked at the select code, but don't quite get it yet. I need
someone with more experience in the fs and net code to have a look.

-- 
Rob Riggs
Technical Staff
Tummy.com, Ltd.
http://www.tummy.com/
--------------C9BEFE783FA9FD1067D432F3
Content-Type: text/plain; charset=us-ascii;
 name="oops.diamond"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="oops.diamond"

general protection: 0000 CPU: 0 EFLAGS: 00010206 eax: f000ef6f ebx: 00000001 ecx: 00000000 edx: 00000090 esi: 062aa044 edi: 00000001 ebp: 00000000 esp: 04e5fe44 ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018 Process python (pid: 871, process nr: 62, stackpage=04e5f000) Stack: 0013fc84 001368d2 00000000 062aa044 00000001 00000000 00000006 062aa044 00000000 00000000 00000000 00136ae5 00000001 00000000 062aa044 00002710 00000000 00000007 beffa324 00000001 06476000 00000001 06476000 00136e62 Call Trace: [sock_select+0/48] [select_check+50/132] [do_select+449/804] [sys_select+394/704] [def_callback2+38/44] [udp_rcv+1005/1024] [ip_rcv+1232/1648] [allow_interrupts+146/224] [do_bottom_half+59/96] [system_call+259/316] Code: 8b 40 24 85 c0 74 0c 51 53 52 ff d0 83 c4 0c 5b c3 89 f6 31

--------------C9BEFE783FA9FD1067D432F3--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/