Re[4]: Help needed on Serial driver issues

Shashi Ramamurthy (sramamurthy@equinox.com)
Thu, 22 Jan 98 14:10:09 -0500


Ted,
My explanation is quite long and hopefully I would made sense to you by the
end, My theory is based on the fact of the kernel missing the flush_to_ldisc
routine as you had mentioned.
When I wrote the driver originally, I was only checking the
tty->flip.count and TTY_FLIPBUF_SIZE (the intermediate buffer) in the input
routines, it is only when we started to lose data that I added the checking of
tty->read_cnt and tty->read_head variables of the line discipline. So the latter
checks was done in ADDITION to the checks of the intermediate flip buffers, but
these checks are not helping. I know for a fact that the flip buffers are not
being overwritten as the checks are being done in the input routine and the
tty->flip.count variable is incremented in the driver as more data is copied
from the onboard buffer to the flip buffer. However checking of the line
discipline buffers will only make sense if the flip_to_ldisc is run every time
it is scheduled to run from the input routine on a timer tick. If the kernel
misses this on a busy system as you mentioned, the tty->read_cnt and
tty->read_buf are not updated ( as they are done only in the line discipline)
and we will lose data. Let me try to explain this with an example,

...at instant t1.........

1) rcv_cnt = MIN(rcv_cnt,TTY_FLIPBUF_SIZE - tty->flip.count);

/* if we assume flip.count was zero and the hardware had 198
characters in its buffer,
then rcv_cnt = MIN(198,512) = 198 */

2) rcv_cnt = MIN(rcv_cnt, MIN(N_TTY_BUF_SIZE - tty->read_cnt,
N_TTY_BUF_SIZE - tty->read_head));

/* if read_cnt was 3918 and read_head was 3517
then rcv_cnt = MIN(198,178) = 178 */

...... does processing, increments flip.count and
the other flip buffer pointers

3) calls queue_task(&tty->flip.tqueue, &tq_timer);

Now if kernel misses running flush_to_ldisc and we get to the

polling routine at instant t2.......

Then at step (1) above, the space left in the flip buffer
is 512 - 178 and if for argument sake, the number of characters to
be transferred from the hardware is now 50, then rcv_cnt is 50.
After step(2) rcv_cnt is still 50 as the read_cnt and read_head never
got updated because we missed the flush_to_ldisc routine. Now flip
buffer has 228 characters but the line discipline only has space for
178 characters and we will lose 50 characters.

My understanding of this problem is that we might lose data when the
routine flush_to_ldisc is not run. As you suggested, the only solution is write
to the line discipline directly. As I had mentioned before, the driver was based
on the serial driver model and that will not work with the way the tty susbsytem
has been designed to handle the dumb serial cards. Am I right so far?. Anyway I
am going to download the rocketport driver written for the expermental 2.1
kernels and check the the input routines there as per your suggestion. Thanks
for the hint:-).

Now you were asking me about the flow control and throttle handling in
our driver. The way it works is as follows,
1. we set RTSCTS in the hardware and hardware handles it based on the
status of the buffer and control signals.

2. When the driver's throttle function is called from the line
discipline, I set a software flag for the input routine to know that the upper
layers cannot handle anymore data and it will not do any any more input
processing until it sees the software flag turned off which will happen when
unthrottle function is called.

In the example above, the throttle function could not be called because
the line discipline still had space for 178 characters left and it ultimately
resulted in loss of data because of missing the flush_to_ldisc call.

The application that I use to get to this scenario is a diagnostics
application that we have written called "ssdiag" that can run both internal as
well as external loopback tests on the port modules that is attached to our PC
card through a host expansion cable. I was trying to run loopback tests on 64
ports at 115200 baud on my 75Mhz Pentium machine. Yes, I agree the system will
slow down but we should not lose data though!!!.

I really appreciate your patience in responding to my questions and
concerns. Please let me know about what you think and I will be looking at
making the changes to write to the line discipline.

Thanks
Shashi
______________________________ Reply Separator _________________________________
Subject: Re: Re[2]: Help needed on Serial driver issues
Author: "Theodore Y. Ts'o" <tytso@MIT.EDU> at Internet-Mail
Date: 1/22/98 9:39 AM

Date: Wed, 21 Jan 98 18:23:35 -0500
From: Shashi Ramamurthy <sramamurthy@equinox.com>

It is not a flow control problem, Flow control has been set to RTSCTS.
We are not losing data at the driver/board level for sure and I have
implemented the the throttle function for the line discipline to call
the throttle function in the event of "read_buf" getting to 128 or
below. The amount of space left in the tty->read_buf as calculated in
the driver input routine which is based on tty->read_head and
tty->read_cnt seems to be more than when it is calculated in the
n_tty_receive_buf routine, This is causing the loss of data and I know
this for sure as I am printing the value of count at the end of the
routine. This problem only starts happening when a lot of ports are
being used (on my 75Mhz box, about 48 ports or more). The cpu idle time
as reported by "top" becomes zero.

Err... why is your driver input routine trying to figure out how much
space the line discpline can take? First of all, if the line discpline
buffer is full, there's very little you can do except drop characters on
the floor, and secondly, if you're getting to the point where the line
discpline is full, that means that the flow control can't be working
correctly, since RTS should have been dropped a full 128 characters ago.

How often are you polling, and how are you implementing the flow
control? Are you perhaps ignoring the throttle message, and letting the
board drop RTS when its FIFO is full? You can do this, but it's rather
dangerous to get right; I don't recommend it.

In any case, trying to figure out how many characters the line discpline
can take by measuring read_buf and read_cnt is surely wrong, though,
since that doesn't take into account how many characters are in the flip
buffer. What I suspect is happening is that when the CPU gets really
busy, the kernel manages to skip the flush_to_ldisc for a particular
timer tick, and so it doesn't run. That means that your driver input
routine runs twice without the flush_to_ldisc running; then your driver
doesn't accurately estimate how many characters to send the line
discpline, since it didn't know about the characters still in the flip
buffer.

The big question, though is why you had enough characters in your
board's internal FIFO's such that you were in danger of running out of
buffer space in the first place? If the flow control was working
correctly, there should have been 128 bytes worth of grace to empty out
your board's buffers and to let the other side stop transmitting.
Irregardless of the bug in your driver in terms of trying to guess how
many characters the line discpline could take, there's no reason for
your driver to make that estimation in the first place, since if flow
control was working correctly, it should have never come to that anyway.

Finally --- a free hint. If you're writing a polling device driver,
there's no reason to use the flip buffers. The flip buffers are designed
to be used when a driver needs to minimize interrupt latency (especiallyl
when the board is generating an interrupt for every character), at the
cost of increasing the latency that it takes to actually process incoming
characters. However, for polling drivers, that's not an issue, since the
board has already buffered the characters once already. So, you can just
simply call the tty->ldisc.receive_buf directly.

If you want an example of how to do this right, see the Rocketport
rocketport driver in the latest 2.1 kernel (drivers/char/rocket.c, in
rp_do_receive). You'll note that rp_do_receive doesn't bother checking
how many characters are left in the line discpline, since if that buffer
has filled, there's nothing you can do about it.

- Ted

P.S. What application is running on all of these ports? Are they
running PPP, or is it some kind of user-mode login processes, or UUCP,
or something else? If the CPU is down to 0%, the machine is obviously
under powered anyway, and the users are getting degraded service one way
or another. Granted dropping characters are bad, but if the CPU is
maxed out, overall performance isn't going to be good no matter what....