Re: open(2) says O_DIRECT works on 512 byte boundries?

From: KAMEZAWA Hiroyuki
Date: Mon Feb 02 2009 - 23:15:19 EST


On Mon, 2 Feb 2009 23:08:56 +0100
Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:

> Hi Greg!
>
> > Thanks for the pointers, I'll go read the thread and follow up there.
>
> If you also run into this final fix is attached below. Porting to
> mainline is a bit hard because of gup-fast... Perhaps we can use mmu
> notifiers to fix gup-fast... need to think more about it then I'll
> post something.
>
> Please help testing the below on pre-gup-fast kernels, thanks!
>
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Subject: fork-o_direct-race
>
> Think a thread writing constantly to the last 512bytes of a page, while another
> thread read and writes to/from the first 512bytes of the page. We can lose
> O_DIRECT reads (or any other get_user_pages write=1 I/O not just bio/O_DIRECT),
> the very moment we mark any pte wrprotected because a third unrelated thread
> forks off a child.
>
> This fixes it by never wprotecting anon ptes if there can be any direct I/O in
> flight to the page, and by instantiating a readonly pte and triggering a COW in
> the child. The only trouble here are O_DIRECT reads (writes to memory, read
> from disk). Checking the page_count under the PT lock guarantees no
> get_user_pages could be running under us because if somebody wants to write to
> the page, it has to break any cow first and that requires taking the PT lock in
> follow_page before increasing the page count. We are guaranteed mapcount is 1 if
> fork is writeprotecting the pte so the PT lock is enough to serialize against
> get_user_pages->get_page.
>
> The COW triggered inside fork will run while the parent pte is readonly to
> provide as usual the per-page atomic copy from parent to child during fork.
> However timings will be altered by having to copy the pages that might be under
> O_DIRECT.
>
> The pagevec code calls get_page while the page is sitting in the pagevec
> (before it becomes PageLRU) and doing so it can generate false positives, so to
> avoid slowing down fork all the time even for pages that could never possibly
> be under O_DIRECT write=1, the PG_gup bitflag is added, this eliminates
> most overhead of the fix in fork.
>
> Patch doesn't break kABI despite introducing a new page flag.
>
> Fixed version of original patch from Nick Piggin.
>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
>
Just curious, could you confirm following ?

(following is from RHEL5.3)

It seems page_cache_release() is called before page_cache_get().
Doesn't this break your assumption ?
(I mean following code is buggy.)
Do we have extra page_count() somewhere before page_cache_release() ?

=
667 submit_page_section(struct dio *dio, struct page *page,
668 unsigned offset, unsigned len, sector_t blocknr)
669 {
670 int ret = 0;
671
672 /*
673 * Can we just grow the current page's presence in the dio?
674 */
675 if ( (dio->cur_page == page) &&
676 (dio->cur_page_offset + dio->cur_page_len == offset) &&
677 (dio->cur_page_block +
678 (dio->cur_page_len >> dio->blkbits) == blocknr)) {
679 dio->cur_page_len += len;
680
681 /*
682 * If dio->boundary then we want to schedule the IO now to
683 * avoid metadata seeks.
684 */
685 if (dio->boundary) {
686 ret = dio_send_cur_page(dio);
687 page_cache_release(dio->cur_page);
688 dio->cur_page = NULL;
689 }
690 goto out;
691 }
692
693 /*
694 * If there's a deferred page already there then send it.
695 */
696 if (dio->cur_page) {
697 ret = dio_send_cur_page(dio);
698 page_cache_release(dio->cur_page);
699 dio->cur_page = NULL;
700 if (ret)
701 goto out;
702 }
703
704 page_cache_get(page); /* It is in dio */
705 dio->cur_page = page;
706 dio->cur_page_offset = offset;
707 dio->cur_page_len = len;
708 dio->cur_page_block = blocknr;
709 out:
710 return ret;
711 }
==

Regards,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/