Re: [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end

From: Dave Hansen
Date: Tue May 21 2013 - 17:49:49 EST


On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
>
> For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
> It's probably to weak condition and need to be reworked later.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
> ---
> fs/libfs.c | 50 ++++++++++++++++++++++++++++++++++++-----------
> include/linux/pagemap.h | 8 ++++++++
> 2 files changed, 47 insertions(+), 11 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 916da8c..ce807fe 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
>
> int simple_readpage(struct file *file, struct page *page)
> {
> - clear_highpage(page);
> + clear_pagecache_page(page);
> flush_dcache_page(page);
> SetPageUptodate(page);
> unlock_page(page);
> @@ -394,21 +394,44 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> - struct page *page;
> + struct page *page = NULL;
> pgoff_t index;

I know ramfs uses simple_write_begin(), but it's not the only one. I
think you probably want to create a new ->write_begin() function just
for ramfs rather than modifying this one.

The optimization that you just put in a few patches ago:

>> +static inline struct page *grab_cache_page_write_begin(
>> +{
>> + if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
>> + return NULL;
>> + return __grab_cache_page_write_begin(mapping, index, flags);


is now worthless for any user of simple_readpage().

> index = pos >> PAGE_CACHE_SHIFT;
>
> - page = grab_cache_page_write_begin(mapping, index, flags);
> + /* XXX: too weak condition? */

Why would it be too weak?

> + if (mapping_can_have_hugepages(mapping)) {
> + page = grab_cache_page_write_begin(mapping,
> + index & ~HPAGE_CACHE_INDEX_MASK,
> + flags | AOP_FLAG_TRANSHUGE);
> + /* fallback to small page */
> + if (!page) {
> + unsigned long offset;
> + offset = pos & ~PAGE_CACHE_MASK;
> + len = min_t(unsigned long,
> + len, PAGE_CACHE_SIZE - offset);
> + }

Why does this have to muck with 'len'? It doesn't appear to be undoing
anything from earlier in the function. What is it fixing up?

> + BUG_ON(page && !PageTransHuge(page));
> + }

So, those semantics for AOP_FLAG_TRANSHUGE are actually pretty strong.
They mean that you can only return a transparent pagecache page, but you
better not return a small page.

Would it have been possible for a huge page to get returned from
grab_cache_page_write_begin(), but had it split up between there and the
BUG_ON()?

Which reminds me... under what circumstances _do_ we split these huge
pages? How are those circumstances different from the anonymous ones?

> + if (!page)
> + page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page)
> return -ENOMEM;
> -
> *pagep = page;
>
> - if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> - unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> - zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> + if (!PageUptodate(page)) {
> + unsigned from;
> +
> + if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> + from = pos & ~HPAGE_PMD_MASK;
> + zero_huge_user_segment(page, 0, from);
> + zero_huge_user_segment(page,
> + from + len, HPAGE_PMD_SIZE);
> + } else if (len != PAGE_CACHE_SIZE) {
> + from = pos & ~PAGE_CACHE_MASK;
> + zero_user_segments(page, 0, from,
> + from + len, PAGE_CACHE_SIZE);
> + }
> }
> return 0;
> }
> @@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
>
> /* zero the stale part of the page if we did a short copy */
> if (copied < len) {
> - unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> - zero_user(page, from + copied, len - copied);
> + unsigned from;
> + if (PageTransHuge(page)) {
> + from = pos & ~HPAGE_PMD_MASK;
> + zero_huge_user(page, from + copied, len - copied);
> + } else {
> + from = pos & ~PAGE_CACHE_MASK;
> + zero_user(page, from + copied, len - copied);
> + }
> }

When I see stuff going in to the simple_* functions, I fear that this
code will end up getting copied in to each and every one of the
filesystems that implement these on their own.

I guess this works for now, but I'm worried that the the next fs is just
going to copy-and-paste these. Guess I'll yell at them when they do it. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/