Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILEhandlers

From: KOSAKI Motohiro
Date: Fri Jun 08 2012 - 02:39:52 EST


(6/6/12 7:56 PM), John Stultz wrote:
> On 06/06/2012 12:52 PM, KOSAKI Motohiro wrote:
>>> The key point is we want volatile ranges to be purged in the order they
>>> were marked volatile.
>>> If we use the page lru via shmem_writeout to trigger range purging, we
>>> wouldn't necessarily get this desired behavior.
>> Ok, so can you please explain your ideal order to reclaim. your last mail
>> described old and new volatiled region. but I'm not sure regular tmpfs pages
>> vs volatile pages vs regular file cache order. That said, when using shrink_slab(),
>> we choose random order to drop against page cache. I'm not sure why you sure
>> it is ideal.
>
> So I'm not totally sure its ideal, but I can tell you what make sense to
> me. If there is a more ideal order, I'm open to suggestions.
>
> So volatile ranges should be purged first-in-first-out. So the first
> range marked volatile should be purged first. Since volatile ranges
> might have different costs depending on what filesystem the file is
> backed by, this LRU order is per-filesystem.
>
> It seems that if we have tmpfs volatile ranges, we should purge them
> before we swap out any regular tmpfs pages. Thus why I'm purging any
> available ranges on shmem_writepage before swapping, rather then using a
> shrinker now (I'm hoping you saw the updated patchset I sent out friday).
>
> Does that make sense?
>
>> And, now I guess you think nobody touch volatiled page, yes? because otherwise
>> volatile marking order is silly choice. If yes, what's happen if anyone touch
>> a patch which volatiled. no-op? SIGBUS?
>
> So more of a noop. If you read a page that has been marked volatile, it
> may return the data that was there, or it may return an empty nulled page.
>
> I guess we could throw a signal to help avoid developers making
> programming mistakes, but I'm not sure what the extra cost would be to
> set up and tare that down each time. One important aspect of this is
> that in order to make it attractive for an application to mark ranges as
> volatile, it has to be very cheap to mark and unmark ranges.

ok, i agree we don't need to pay any extra cost.

>> Which worklord didn't work. Usually, anon pages reclaim are only
>> happen when 1) tmpfs streaming io workload or 2) heavy vm pressure.
>> So, this scenario are not so inaccurate to me.
>
> So it was more of a theoretical issue in my discussions, but once it was
> brought up, ashmems' global range lru made more sense.

No. Every global lru is evil. Please don't introduce numa unaware code for
a new feature. That's a legacy and poor performance.


> I think the workload we're mostly concerned with here is heavy vm pressure.

I don't admit it. but note, when under heavy workload, shrink_slab() behave
stupid seriously.



>>> That's when I added the LRU tracking at the volatile range level (which
>>> reverted back to the behavior ashmem has always used), and have been
>>> using that model sense.
>>>
>>> Hopefully this clarifies things. My apologies if I don't always use the
>>> correct terminology, as I'm still a newbie when it comes to VM code.
>> I think your code is enough clean. But I'm still not sure your background
>> design. Please help me to understand clearly.
> Hopefully the above helps. But let me know where you'd like more
> clarification.
>
>
>> btw, Why do you choice fallocate instead of fadvise? As far as I skimmed,
>> fallocate() is an operation of a disk layout, not of a cache. And, why
>> did you choice fadvise() instead of madvise() at initial version. vma
>> hint might be useful than fadvise() because it can be used for anonymous
>> pages too.
> I actually started with madvise, but quickly moved to fadvise when
> feeling that the fd based ranges made more sense. With ashmem, fds are
> often shared, and coordinating volatile ranges on a shared fd made more
> sense on a (fd, offset, len) tuple, rather then on an offset and length
> on an mmapped region.
>
> I moved to fallocate at Dave Chinner's request. In short, it allows
> non-tmpfs filesystems to implement volatile range semantics allowing
> them to zap rather then writeout dirty volatile pages. And since the
> volatile ranges are very similar to a delayed/cancel-able hole-punch, it
> made sense to use a similar interface to FALLOC_FL_HOLE_PUNCH.
>
> You can read the details of DaveC's suggestion here:
> https://lkml.org/lkml/2012/4/30/441

Hmmm...

I'm sorry. I can't imagine how to integrate FALLOCATE_VOLATILE into regular
file systems. do you have any idea?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/