On Fri, Oct 27, 2000 at 02:23:04PM -0700, Linus Torvalds wrote:
>
> [...]
>
> That solution, btw, might be as simple as just saying:
>
> - raw IO is based on physical pages, and the COW mapping crated by
> fork() may cause the changes to be visibile to either child or parent
> or both, depending on usage patterns to the page in question. For
> repeatable behaviour, do not have outstanding direct IO in progress
> over a fork().
>
> Ie, just _document_ it. It's not _wrong_, it can just be surprising (but
> it is actually entirely straightforward and sane if you just look at it
> the right way).
Ok, here is an updated patch witout that change, but instead with a little
piece of kiobuf documentation that does document this and other things
related to kiobufs.
Christoph
-- Always remember that you are unique. Just like everyone else.--- linux.orig/drivers/char/raw.c Thu Oct 19 13:21:24 2000 +++ linux/drivers/char/raw.c Sun Oct 29 20:55:43 2000 @@ -277,8 +277,11 @@ if ((*offp & sector_mask) || (size & sector_mask)) return -EINVAL; - if ((*offp >> sector_bits) > limit) + if ((*offp >> sector_bits) > limit) { + if (size) + return -ENXIO; return 0; + } /* * We'll just use one kiobuf --- linux.orig/fs/buffer.c Fri Oct 27 12:28:40 2000 +++ linux/fs/buffer.c Sun Oct 29 20:55:43 2000 @@ -1924,6 +1924,8 @@ spin_unlock(&unused_list_lock); + if (!iosize) + return -EIO; return iosize; } --- linux.orig/mm/memory.c Fri Oct 27 12:28:42 2000 +++ linux/mm/memory.c Sun Oct 29 20:56:09 2000 @@ -382,9 +382,12 @@ /* - * Do a quick page-table lookup for a single page. + * Do a quick page-table lookup for a single page. We have already verified + * access type, and done a fault in. But, kswapd might have stolen the page + * in the meantime. Return an indication of whether we should retry the fault + * in. Writability test is superfluous but conservative. */ -static struct page * follow_page(unsigned long address) +static struct page * follow_page(unsigned long address, int writeacc, int * ret) { pgd_t *pgd; pmd_t *pmd; @@ -393,10 +396,15 @@ pmd = pmd_offset(pgd, address); if (pmd) { pte_t * pte = pte_offset(pmd, address); - if (pte && pte_present(*pte)) + if (pte && pte_present(*pte)) { + if (writeacc && !pte_write(*pte)) + goto retry; return pte_page(*pte); + } } - + +retry: + *ret = 1; return NULL; } @@ -428,7 +436,8 @@ struct page * map; int i; int datain = (rw == READ); - + int failed; + /* Make sure the iobuf is not already mapped somewhere. */ if (iobuf->nr_pages) return -EINVAL; @@ -467,18 +476,22 @@ } if (((datain) && (!(vma->vm_flags & VM_WRITE))) || (!(vma->vm_flags & VM_READ))) { - err = -EACCES; goto out_unlock; } } + +faultin: if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0) goto out_unlock; spin_lock(&mm->page_table_lock); - map = follow_page(ptr); - if (!map) { + map = follow_page(ptr, datain, &failed); + if (failed) { + /* + * Page got stolen before we could lock it down. + * Retry. + */ spin_unlock(&mm->page_table_lock); - dprintk (KERN_ERR "Missing page in map_user_kiobuf\n"); - goto out_unlock; + goto faultin; } map = get_page_map(map); if (map) diff -uNr linux.orig/Documentation/kiobuf.txt linux/Documentation/kiobuf.txt --- linux.orig/Documentation/kiobuf.txt Thu Jan 1 01:00:00 1970 +++ linux/Documentation/kiobuf.txt Sun Oct 29 21:38:20 2000 @@ -0,0 +1,100 @@ + Abstract Kernel IO Buffers + Under Linux + + Christoph Hellwig <hch@caldera.de> + + +This document describes the kiobuf concept used in the Linux Kernel +IO/memory subsystem. It describes it's usages, functions working +with kernel IO buffers and show some examples for kiobuf usage. + + +The main reason for implementing kernel IO buffers (by Stephen Tweedie) +was the lack of raw devices support in Linux kernels <= 2.2. Raw devices +are the character devices that AT&T derived UNIX version implement to +allow character based uncached access to mass storage devices. In +Linux kernels <= 2.2 all blockdevice IO goes either through the buffer- +or pagecache, so that applications like databases cannot get full +control over their data. + +The solution in Linux 2.3 an higher is that the new raw devices driver +locks down the virtual memory it gets passed by the ->read and ->write +methods and does physical page io on them, bypassing the caches. +NOTE: the physical memory referenced by kiobufs does - unlike nearly +everything else in the Linux memory managment - not have reasonable COW +semenantics. So don't even try to fork when doing rawio or using +user-space memory in kiobufs in an other way. + + +To use iobufs in this way you need to allocate one or more kiobufs (an +array of kiobufs is called kiovec - do not confuse those with BSD iovecs). + + err = alloc_kiovec (count, iovec); + +This allocates the memory for the wanted number of kiobufs (and adds them +to a cache) and initalizes some variables - in an OO-language this would be +the constructor. Then you force the virtual memory to faulted in and locked +in physical memory and reference it by the kiobuf. (NOTE: this must be done +for each iobuf, not for the whole iovec). + + err = map_user_kiobuf (rw, iobuf, address, len); + +After that you request IO against the wanted device. For the case of +raw devices where IO should be requested against a blockdevice, there +is a function in fs/buffer.c that does exactly this. (the parameter +'blocks' is an array of the block numbers the IO should be requested +against) + + err = brw_kiovec (rw, count, iovec, dev, blocks, sector_size); + +After the IO for this iobuf is done, unmap the virtual memory. + + unmap_kiobuf (iobuf); + +And when we are finished with the iovev, free it. + + free_kiovec (count, iovec); + + +Locking down user memory and doing mass storage device IO with it is not +the only purpose of kiobufs. Another use for kiobufs is allowing +user-space mmaping dma memory, e.g in sound drivers. To do so you +need to lock-down kernel virtual memory and refernece it using kiobufs. +The code that does exactly this is not yet in the kernel - get Stephen +Tweedie's kiobuf patchset if you want to use this. + + +In the long term it looks like all blockdev IO will be done using +kiobufs. In the SGI XFS tree there is code that allows passing kiovecs +to the individual low-level block drivers. There are lots of advantages +of doing it this way: the page cache doesn't need to fit the outstanding +io into lots of bufferheads, passing each bufferhead to ll_rw_block() +where the elevator merges some of them together for better device usage +and submits them to the drivers. Instead the cache locks down the pages +and submits the kiovec to the low-level driver. The lowlevel driver knows +better how the request should be splitted for dmaing or whatever. On the +other hand software RAID or LVM get more complicated: instead of just +doing block-remapping they must split the kiobufs and - in case of LVM - +find ways to do efficient IO on continguos areas. + + + +References: + + Linux Kernel Sourcecode + (fs/buffer.c, fs/iobufs.c, mm/memory.c, drivers/char/raw.c) + + SGI XFS for Linux + (http://oss.sgi.com/projects/linux-xfs/) + + Stephen Tweedies kiobuf patchset + (ftp://ftp.linux.org.uk/pub/linux/sct/fs/raw-io/) + + Linux MM mailinglist + (http://humbolt.geo.uu.nl/Linux-MM/linux-mm.html) + + +Thanks to Arjan van de Ven, Daniel Phillips and Marcelo Tosatti for +proofreading this document and giving usefull hints. + +$Id: kiobuf.txt,v 1.2 2000/10/29 20:37:54 hch Exp hch $ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:26 EST