Re: PATCH: Raw device IO for 2.1.131

Stephen C. Tweedie (sct@redhat.com)
Sat, 12 Dec 1998 15:26:39 GMT


Hi,

On Fri, 11 Dec 1998 19:20:37 -0800, "David S. Miller"
<davem@dm.cobaltmicro.com> said:

> It doesn't handle shared writable pages correctly at all.

> In order to handle that you'd have to flush the page table
> entries for all other processes mapping that shared writable page, and
> then rerun your page lockdown loop until things converge and the
> process stabilizes.

> That's one solution, another one (which I'd prefer) is to just
> snapshot the page into freshly allocated one and free this one at the
> end of the transfer.

Why?

There are several reasons why we might want to do this, but the current
behaviour was deliberately chosen simply because up to now I can't see
any reason why we need to do anything more complex. There are several
potential arguments for trying to snapshot the write, but each argument
has a counter-argument. As such, your comments are exactly the kind of
debate we need to decide which is the correct solution, but I don't
think you've explained yet precisely what is wrong with the semantics of
the original code.

Reasons why shared writable pages might be a problem:

1) You aren't taking an atomic snapshot of the contents of that
memory.

I don't agree with this: there *is* no atomic snapshot of memory,
especially on a threaded application and over SMP, but even on a
single UP application (due to paging/swapping effects).

Not even the current write(2) implementation tries to implement
such semantics.

2) You don't want to DMA from memory still being accessed by the CPU.

I am definitely prepared to believe that there may be dragons
lurking here. However, the fact is that the kernel _already_ does
this in at least one other situation: ext2fs access to cached
filesystem bitmap blocks has a fast-path cache of the 8 most
recently used bitmap buffers for each filesystem, and access to
these bitmaps does not perform any sort of wait_on_buffer(). As a
result, ext2fs is already quite happy to modify a bitmap buffer
even if we are currently DMA-writing that buffer to disk.

3) There is no synchronisation with other applications accessing the
data.

Again, we don't make any such synchronisation guarantees for normal
reads or writes, so I don't see why we need to make that extra
overhead for raw IO.

There _is_ one place where I do think that the current implementation
is lacking: I expect that we probably need a flush-page-to-memory for
writable pages before doing any WRITE, and a cache flush after doing a
READ operation. Dave, comments on this?

> Another side comment I have is that the code can be significantly fast
> path'd for the common case by just walking the page tables directly
> much like copy_page_range() and friends do, I bet this will be several
> orders of magnitude faster.

Indeed. That and the moving of the page stuff into linux/mm/ are
already on the TODO list, but I want to make sure that the core
functionality and semantics are sound before splitting the code up or
fine-tuning the performance.

> I suppose the entry to the fast path can be guarded by checking that
> at least the VA range of the transfer resides inside one VMA and only
> one.

I'm not sure: all we need is a quick page table traversal to find out
whether you have read access with user privilege and that, if
appropriate, the pages are writable. If that is satisfied, then I'm not
sure that we need to consult the vma at all.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/