Re: [PATCH -mm -v7 9/9] mm, THP, swap: Delay splitting THP during swap out

From: Huang\, Ying
Date: Fri Mar 31 2017 - 22:54:51 EST


Johannes Weiner <hannes@xxxxxxxxxxx> writes:

> On Thu, Mar 30, 2017 at 12:15:13PM +0800, Huang, Ying wrote:
>> Johannes Weiner <hannes@xxxxxxxxxxx> writes:
>> > On Tue, Mar 28, 2017 at 01:32:09PM +0800, Huang, Ying wrote:
>> >> @@ -198,6 +240,18 @@ int add_to_swap(struct page *page, struct list_head *list)
>> >> VM_BUG_ON_PAGE(!PageLocked(page), page);
>> >> VM_BUG_ON_PAGE(!PageUptodate(page), page);
>> >>
>> >> + if (unlikely(PageTransHuge(page))) {
>> >> + err = add_to_swap_trans_huge(page, list);
>> >> + switch (err) {
>> >> + case 1:
>> >> + return 1;
>> >> + case 0:
>> >> + /* fallback to split firstly if return 0 */
>> >> + break;
>> >> + default:
>> >> + return 0;
>> >> + }
>> >> + }
>> >> entry = get_swap_page();
>> >> if (!entry.val)
>> >> return 0;
>> >
>> > add_to_swap_trans_huge() is too close a copy of add_to_swap(), which
>> > makes the code error prone for future modifications to the swap slot
>> > allocation protocol.
>> >
>> > This should read:
>> >
>> > retry:
>> > entry = get_swap_page(page);
>> > if (!entry.val) {
>> > if (PageTransHuge(page)) {
>> > split_huge_page_to_list(page, list);
>> > goto retry;
>> > }
>> > return 0;
>> > }
>>
>> If the swap space is used up, that is, get_swap_page() cannot allocate
>> even 1 swap entry for a normal page. We will split THP unnecessarily
>> with the change, but in the original code, we just skip the THP. There
>> may be a performance regression here. Similar problem exists for
>> mem_cgroup_try_charge_swap() too. If the mem cgroup exceeds the swap
>> limit, the THP will be split unnecessary with the change too.
>
> If we skip the page, we're swapping out another page hotter than this
> one. Giving THP preservation priority over LRU order is an issue best
> kept for a separate patch set;

In my original patch, if we failed to allocate the swap space for a THP,
and we can allocate the swap space for a normal page, we will split the
THP. We skip the page only if we cannot allocate the swap space for a
normal page, that is, nr_swap_pages is 0. So we will not give THP
preservation priority over LRU order in the patch.

> this one is supposed to be a mechanical
> implementation of THP swapping. Let's nail down the basics first.

Yes. So I tried to keep the original behavior to deal with THP if we
cannot allocate the swap space (a swap cluster) for a whole THP.

Per my understanding, the difference between what you suggested and the
original behavior is that, when nr_swap_pages is 0, whether to split the
THP.

> Such a decision would need proof that splitting THPs on full swap
> devices is a concern for real applications. I would assume that we're
> pretty close to OOM anyway; it's much more likely that a single slot
> frees up than a full cluster, at which point we'll be splitting THPs
> anyway; etc. I have my doubts that this would be measurable.
>
> But even if so, I don't think we'd have to duplicate the main code
> flow to handle this corner case. You can extend get_swap_page() to
> return an error code that tells add_to_swap() whether to split and
> retry, or to fail and move on. So this way should be future proof.

Yes. I will try to merge add_to_swap_trans_huge() into add_to_swap() in
the next version. But if we want to keep the original behavior, we will
need an extra "nr_entries" parameter for mem_cgroup_try_charge_swap().

Best Regards,
Huang, Ying