Re: PCI DMA into USER space

From: Richard B. Johnson (root@chaos.analogic.com)
Date: Wed Jun 14 2000 - 17:05:09 EST

Next message: Joseph Gooch: "RE: Ke: Process Capabilities on 2.2.16, Sendmail problem revisited"
Previous message: Eric Buddington: "Oops in 2.4.0test1-ac17"
In reply to: Jeff Garzik: "Re: PCI DMA into USER space"
Next in thread: Meino Christian Cramer: "2.4.0test1ac18: Swap probs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 14 Jun 2000, Jeff Garzik wrote:
[SNIPPED...]
>
> Are you saying that, on a normal PCI-based x86 workstation, grabbing to
> the Meteor's onboard RAM, and allowing xawtv to access that memory
> across the PCI bus, will be the faster case?
>
> Regards,
>
> Jeff
>

The generic answer is.... "It depends".

Lets define three "DMA" operations known to exist in hardware
designed for use in the Intel/PC architecture.

(1) DMA using the 8237x controller.
(2) DMA to/from SDRAM and PCI using the host bridge.
(3) DMA to/from SDRAM using a hardware-specific "Bus Master".

In all three cases, the CPU is not involved in the actual data
transfer. However, the CPU is usually involved in setting up the
transfer.

Let's take case (1). This is an old slow chip. It has been improved and
now there are two of them. Early-on, you couldn't do memory-to-memory
transfers because channel 0, the only one capable of doing this, was
dedicated to doing dummy memory reads for refresh. Early PC/ATs, and
before, didn't have memory controllers that did transparent refresh.

You can do memory-to-memory transfers now. However, because of the
slow speed I don't think anybody would want to do this. These DMA
devices are usually used to transparently empty buffers going to and
from disk drives, usually only the floppy. They read/write to/from
RAM and a port. When I talk about DMA generically, this is what I
mean.

Since the CPU can usually do something else when waiting for the DMA
controller to finish, this gives some improvement over so-called
programmed I/O where the CPU does everything.

Now look at (2). The PCI bus and its controller(s) are really DMA
devices. They don't involve the CPU when making shared-RAM magically
contain the data that a device wrote to it. Of course they work both
ways, both reads and writes. A great advantage is that the BUS
is "snooped" and it contains a FIFO. This allows writes to be posted
as previously described.

Although the data are transferred at a reasonable rate, they have the
failing that the data area to and from which the transfer will occur
is not dynamic. The BIOS (and with Linux, the kernel), sets the
I/O addresses in the device's 32-bit configuration registers upon
startup. Since both the device and the operating system now "know"
where the data transfer will occur, the operating system is able to
reserve some pages somewhere so the devices don't corrupt somebody
else's memory.

To the CPU, this looks just like memory that is shared between a device
and the CPU address space. It's a lot better than regular shared memory
because the operating system was able to put it where it wanted, instead
of the hardware saying it's gonna be at 0xd0000 whether you like it
or not. Further, until the FIFO gets full, writes don't incur any
wait-states.

The problem remains that data goes to and from fixed addresses. This
means that somebody (the driver?) has to copy it somewhere else, perhaps
to user-space. The copy operation takes time and, although you can
copy around 1000 MB/s, you probably have to copy it twice, once out
of the device buffer in an ISR, then later to the user.

Enter method (3). This method is supposed to fix all the above problems.
The idea is, instead of a fixed transfer address, as with the PCI
specification, we make the transfer address dynamic.

In principle, such a Bus Mastering device can write directly to user
memory because it can write to any physical address space anywhere. It
still has the overhead of PCI devices (it uses the same techniques), but
the double copy operations are eliminated.

That said, I still haven't answered the question. And, I don't know
the answer. That's why I write lots of code to test things in the
actual environment in which various methods will be run.

I think that everybody's gut feeling is that Bus Mastering DMA directly
to/from the user application will be faster. However, it may not be.

Here is a simple naive example. Let's say I need to write the contents
of this shared buffer to a file. While the write is in progress, I
would have to "tell" the Bus Master not to write to the buffer until
it's been flushed. So I tell the Bus Master to use another buffer so
that these operations can overlap.

Normally, I would expect that the operations would simply proceed.
Unfortunately, the machine can't use the same bus for two or more
different operations at the same time. They end up being interleaved.
There are even lost bus-cycles in the interleaving operation.

So you end up with very fast operations, great for writing specifications,
when you are doing single operations, and horrible transfer-rates when
you are trying to do parallel operations.

An advantage of the PCI bus is that it's separate! Parallel operations
can occur. Nothing's perfect though.

Cheers,
Dick Johnson

Penguin : Linux version 2.3.36 on an i686 machine (400.59 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Joseph Gooch: "RE: Ke: Process Capabilities on 2.2.16, Sendmail problem revisited"
Previous message: Eric Buddington: "Oops in 2.4.0test1-ac17"
In reply to: Jeff Garzik: "Re: PCI DMA into USER space"
Next in thread: Meino Christian Cramer: "2.4.0test1ac18: Swap probs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 21:00:32 EST