Re: Today Linus redesigns the networking driver interface (was Re: tulip driver in ...)

Donald Becker (becker@cesdis1.gsfc.nasa.gov)
Sun, 20 Sep 1998 01:43:11 -0400 (EDT)


[[ There are a whole slew of points to respond to here.
It's difficult writing a response that covers high-level design, historical
baggage, and implementation details all at the same time, so forgive the
lack of coverage. ]]

On Sat, 19 Sep 1998, Theodore Y. Ts'o wrote:

> SPL's are just completely braindamaged. Linux does the only sane thing,
..
> Note that all of the post-BSD 4.4 implementations have abandoned SPL's
> as having any kind of semantics related to priority. They just keep the
> name SPL for historical reasons.

I've written a few BSD drivers, and I agree. The spl() approach is
awkward to implement and has serious performance implications for modern
machines. It's clear that BSD was designed for machines of a different era.

[[ General comment: There are people that think any BSD approach is the Right
Way just because the source code they have read did it that way. We should
resist the pressure to redesign things just to match BSD. ]]

Likewise, the world is different today than 1992-1993 when the networking
stack was taking shape. Some of the decisions made (well, we didn't know
they were decisions at the time!) were shortsighted. Like using a sloppy
tbusy (a queue-level bug!) as a watchdog timer because we didn't yet have
general purpose timers. Other obvious mistakes were ether=<IRQ>,<IOADDR>,
instead of ether=<IOADDR>,<IRQ> and only probing for one network card. In
retrospect pretty much all of Space.c was a short-sighted attempt to
minimize code size that later proved to difficult to change and expensive to
support.

A few things were cleaned up over the years, some for wrong reasons -- I
cleaned up the receive interface (netif_rx()) just because I didn't like
the assembly code generated with the old interface. I should have cleaned
it up because the old source code was ugly.

> The one thing which the BSD implementations have that we *don't* have is
> the ability mask just one kind of interrupts (say, serial or networking
> interrupts) during some critical section of code. The only thing we can
> do is use cli(), which blocks all interrupts.

Hmmm, I don't know if that's a valid point.
We have busses with shared interrupt lines and multifuction boards.
Masking and priority are no longer viable design assumptions.

> So one of the things which the BSD networking stack can do is to (in
> some critical section of code), prevent network interupts from being
> handled. If a network interrupt comes in during the critical section of
> code, it is queued, and would be executed only after the networking code
> had unblocked network interrupts. (However, if a disk interrupt
> happened during the critical section, it would be handled.)

This leads to a great example of my previous point. AMD and Symbios have
multifunction SCSI+Ethernet PCI chips. 3Com has Ethernet+Modem cards.
Both functions share a single interrupt pin. It's no longer possible to
segregate interrupts by type: network, disk, and serial interrupts cannot be
isolated and selectively turned off.

BSD is built around the assumption that it's harmless to block all
interrupts of a specific type, that such interrupt blocking was inexpensive
but not trivial. That assumption lead to large regions of code that do
their work with some types of interrupts disabled (aka masked).
Specifically, the network layer has a single mask that prevent packets from
being received (on any interface!), and any interrupts being handled for
lower priority device, while doing protocol work.

I think the current Linux design is a good one: decoupling the interrupt
handler from the protocol/filesystem work with a queue layer. The current
implementation may have a few warts, but lets not throw it out just for the
sake of Change. Many of the alternatives are worse.

One specific design difference from BSD is that network packets are never
pulled from the queue layer at interrupt time. Instead dev->tbusy is
cleared, the NET_BH flag is set, and the packets are Tx-queued in the normal
non-interrupt network code. This approach means that the protocol layer
doesn't have to be concerned about locking out network interrupts when
adding to the transmit queue.

This can introduce transmit latency, but only when the device level
transitions from Tx-full to room-for-more. This was painful back in the
days of network hardware that could only handle a single Tx packet, but with
modern descriptor-based cards that are configured with Tx queues of 8 or 16
packets the latency should never be seen.

A driver interface change I would like to see is pretty much the opposite of
what has been proposed:
Changing the name of dev->hard_start_xmit() to dev->queue_tx()

Removing all queue locking code from queue_tx()
The locking code only exist for the timeout watchdog. Without the
watchdog check the driver layer shouldn't care what kind of locking the
queue layer implements.

Changing the name of the driver-visible lock from dev->tbusy to the more
descriptive dev->tx_full.
Perhaps a macro NETDEV_SET_TXFULL(dev) for those that want to
architecture-optimize the locking details?

The queue layer should be responsible for not simultaneously entering
queue_tx(). The driver should never see the lock the protects against this.
The code that calls dev->open (rename to dev->hw_start?) and dev->stop
should initialize and clear both the new lock and the dev->start flag.
Note: The dev->start flag must be retained for driver use e.g.
dev->get_stats().

I know that there is one issue this change doesn't address: how can a driver
*efficiently* protect the short critical region in the Tx routine. The
critical region is usally between queuing a packet and setting tx_full.
Doing cli()/sti() is presumably ruled out for SMP systems.

Turning off the chip's interrupts -- an approach I've seen in a few
drivers -- isn't viable in a shared interrupt environment. Yes, the
interrupt handler can check for disabled interrupts, but that's just an
expensive way of setting a flag.

One approach is having the drivers do something like this
if (tx-queue-full) {
dev->tx_busy = 1;
if (tx-queue-no-longer-full)
dev->tx_busy = 0;
}
This guards aginst the *very* unlikely race condition of the Tx queue
emptying completely in the time between the first queue check and the second.
IMNSHO, this race is so unlikely that it's better handled in a watchdog
routine.

> In practice, the lack of this ability to defer only interrupts from a
> single class of devices hasn't been that much of a problem. It does
> mean that device driver writers have to be very careful not to leave
> interrupts turned off for too long, since a driver which decides to mask
> all interrupts for too long might adversely affect the behaviour of
> another driver.

It's not a serious problem.
Hmmm, I mean It's an educational problem.
It seems that every time someone modifies one of my drivers, it's to put in
a usleep(1000000). (No, sadly, that's not a joke.)
I see many drivers without 'boguscnt' interation limits on the work that
they will do.
And you can fill in my usual flame about people writing drivers that use
SA_INTERRUPT on shared interrupt lines. (Perhaps if they would have been
called "ugly interrupt" instead "fast interrupt"...)

Donald Becker becker@cesdis.gsfc.nasa.gov
USRA-CESDIS, Center of Excellence in Space Data and Information Sciences.
Code 930.5, Goddard Space Flight Center, Greenbelt, MD. 20771
301-286-0882 http://cesdis.gsfc.nasa.gov/people/becker/whoiam.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/