Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out

From: Huang\, Ying
Date: Thu Aug 18 2016 - 21:57:52 EST

Next message: Josh Poimboeuf: "[PATCH 8/8] x86: remove 64-byte gap at end of irq stack"
Previous message: Vineet Gupta: "[PATCH v2] tools lib: Reinstate strlcpy() header guard with __UCLIBC__"
In reply to: Minchan Kim: "Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out"
Next in thread: Minchan Kim: "Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Minchan Kim <minchan@xxxxxxxxxx> writes:

> Hi Tim,
>
> On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
>> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
>> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
>> > >
>> > >
>> > > >
>> > > > I think Tim and me discussed about that a few weeks ago.
>> > > I work closely with Tim on swap optimization. This patchset is the part
>> > > of our swap optimization plan.
>> > >
>> > > >
>> > > > Please search below topics.
>> > > >
>> > > > [1] mm: Batch page reclamation under shink_page_list
>> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
>> > > >
>> > > > It's different with yours which focused on THP swapping while the suggestion
>> > > > would be more general if we can do so it's worth to try it, I think.
>> > > I think the general optimization above will benefit both normal pages
>> > > and THP at least for now. And I think there are no hard conflict
>> > > between those two patchsets.
>> > If we could do general optimzation, I guess THP swap without splitting
>> > would be more straight forward.
>> >
>> > If we can reclaim batch a certain of pages all at once, it helps we can
>> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
>> > greater or less than 512 pages. With that, scan_swap_map effectively
>> > search empty swap slots from scan_map or free cluser list.
>> > Then, needed part from your patchset is to just delay splitting of THP.
>> >
>> > >
>> > >
>> > > The THP swap has more opportunity to be optimized, because we can batch
>> > > 512 operations together more easily. For full THP swap support, unmap a
>> > > THP could be more efficient with only one swap count operation instead
>> > > of 512, so do many other operations, such as add/remove from swap cache
>> > > with multi-order radix tree etc. And it will help memory fragmentation.
>> > > THP can be kept after swapping out/in, need not to rebuild THP via
>> > > khugepaged.
>> > It seems you increased cluster size to 512 and search a empty cluster
>> > for a THP swap. With that approach, I have a concern that once clusters
>> > will be fragmented, THP swap support doesn't take benefit at all.
>> >
>> > Why do we need a empty cluster for swapping out 512 pages?
>> > IOW, below case could work for the goal.
>> >
>> > A : Allocated slot
>> > F : Free slot
>> >
>> > cluster A cluster B
>> > AAAAFFFF - FFFFAAAA
>> >
>> > That's one of the reason I suggested batch reclaim work first and
>> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
>> > and selects right clusters.
>> >
>> > With the approach, justfication of THP swap support would be easier, too.
>> > IOW, I'm not sure how only THP swap support is valuable in real workload.
>> >
>> > Anyways, that's just my two cents.
>>
>> Minchan,
>>
>> Scanning for contiguous slots that span clusters may take quite a
>> long time under fragmentation, and may eventually fail. In that case the addition scan
>> time overhead may go to waste and defeat the purpose of fast swapping of large page.
>>
>> The empty cluster lookup on the other hand is very fast.
>> We treat the empty cluster available case as an opportunity for fast path
>> swap out of large page. Otherwise, we'll revert to the current
>> slow path behavior of breaking into normal pages so there's no
>> regression, and we may get speed up. We can be considerably faster when a lot of large
>> pages are used.
>
> I didn't mean we should search scan_swap_map firstly without peeking
> free cluster but what I wanted was we might abstract it into
> scan_swap_map.
>
> For example, if nr_pages is greather than the size of cluster, we can
> get empty cluster first and nr_pages - sizeof(cluster) for other free
> cluster or scanning of current CPU per-cpu cluster. If we cannot find
> used slot during scanning, we can bail out simply. Then, although we
> fail to get all* contiguous slots, we get a certain of contiguous slots
> so it would be benefit for seq write and lock batching point of view
> at the cost of a little scanning. And it's not specific to THP algorighm.

Firstly, if my understanding were correct, to batch the normal pages
swapping out, the swap slots need not to be continuous. But for the THP
swap support, we need the continuous swap slots. So I think the
requirements are quite different between them.

And with the current design of the swap space management, it is quite
hard to implement allocating nr_pages continuous free swap slots. To
reduce the contention of sis->lock, even to scan one free swap slot, the
sis->lock is unlocked during scanning. When we scan nr_pages free swap
slots, and there are no nr_pages continuous free swap slots, we need to
scan from sis->lowest_bit to sis->highest_bit, and record the largest
continuous free swap slots. But when we lock sis->lock again to check,
some swap slot inside the largest continuous free swap slots we found
may be allocated by other processes. So we may end up with a much
smaller number of swap slots or we need to startover again. So I think
the simpler solution is to

- When a whole cluster is requested (for the THP), try to allocate a
free cluster. Give up if there are no free clusters.

- When a small number of swap slots are requested (for normal swap
batching), check only sis->percpu_cluster and return next N free swap
slots in it. Because we only scan very small number of swap slots, we
can do that with sis->lock held.

BTW: The sis->lock is under heavy contention after the lock contention of
swap cache radix tree lock is reduced via batching in 8 processes
sequential swapping out test.

Best Regards,
Huang, Ying

Next message: Josh Poimboeuf: "[PATCH 8/8] x86: remove 64-byte gap at end of irq stack"
Previous message: Vineet Gupta: "[PATCH v2] tools lib: Reinstate strlcpy() header guard with __UCLIBC__"
In reply to: Minchan Kim: "Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out"
Next in thread: Minchan Kim: "Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]