Re: [PATCH -V5 RESEND 03/21] swap: Support PMD swap mapping in swap_duplicate()

From: Daniel Jordan
Date: Thu Sep 27 2018 - 17:13:01 EST


On Thu, Sep 27, 2018 at 09:34:36AM +0800, Huang, Ying wrote:
> Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> writes:
> > On Wed, Sep 26, 2018 at 08:55:59PM +0800, Huang, Ying wrote:
> >> Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> writes:
> >> > On Tue, Sep 25, 2018 at 03:13:30PM +0800, Huang Ying wrote:
> >> >> /*
> >> >> * Increase reference count of swap entry by 1.
> >> >> - * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
> >> >> - * but could not be atomically allocated. Returns 0, just as if it succeeded,
> >> >> - * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
> >> >> - * might occur if a page table entry has got corrupted.
> >> >> + *
> >> >> + * Return error code in following case.
> >> >> + * - success -> 0
> >> >> + * - swap_count_continuation is required but could not be atomically allocated.
> >> >> + * *entry is used to return swap entry to call add_swap_count_continuation().
> >> >> + * -> ENOMEM
> >> >> + * - otherwise same as __swap_duplicate()
> >> >> */
> >> >> -int swap_duplicate(swp_entry_t entry)
> >> >> +int swap_duplicate(swp_entry_t *entry, int entry_size)
> >> >> {
> >> >> int err = 0;
> >> >>
> >> >> - while (!err && __swap_duplicate(entry, 1) == -ENOMEM)
> >> >> - err = add_swap_count_continuation(entry, GFP_ATOMIC);
> >> >> + while (!err &&
> >> >> + (err = __swap_duplicate(entry, entry_size, 1)) == -ENOMEM)
> >> >> + err = add_swap_count_continuation(*entry, GFP_ATOMIC);
> >> >> return err;
> >> >
> >> > Now we're returning any error we get from __swap_duplicate, apparently to
> >> > accommodate ENOTDIR later in the series, which is a change from the behavior
> >> > introduced in 570a335b8e22 ("swap_info: swap count continuations"). This might
> >> > belong in a separate patch given its potential for side effects.
> >>
> >> I have checked all the calls of the function and found there will be no
> >> bad effect. Do you have any side effect?
> >
> > Before I was just being vaguely concerned about any unintended side effects,
> > but looking again, yes I do.
> >
> > Now when swap_duplicate returns an error in copy_one_pte, copy_one_pte returns
> > a (potentially nonzero) entry.val, which copy_pte_range interprets
> > unconditionally as 'try adding a swap count continuation.' Not what we want
> > for returns other than -ENOMEM.
>
> Thanks for pointing this out! Before the change in the patchset, the
> behavior is,
>
> Something wrong is detected in swap_duplicate(), but the error is
> ignored. Then copy_one_pte() will think everything is OK, so that it
> can proceed to call set_pte_at(). The system will be in inconsistent
> state and some data may be polluted!

Yes, the part about page table corruption in the comment above swap_duplicate.

> But this doesn't cause any problem in practical. Per my understanding,
> because if other part of the kernel works correctly, it's impossible for
> swap_duplicate() return any error except -ENOMEM before the change in
> this patchset.

I agree with that, but it's not what I'm trying to explain. I didn't go into
enough detail, let me try again. Hopefully I'm understanding this right.

While running with these patches, say we're at

copy_pte_range
copy_one_pte
swap_duplicate
__swap_duplicate
__swap_duplicate_locked

And say __swap_duplicate_locked returns an error that isn't -ENOMEM, such as
-EEXIST. That means __swap_duplicate and swap_duplicate also return -EEXIST.
copy_one_pte returns entry.val, which can be and usually is nonzero, so we
break out of the loop in copy_pte_range and then--erroneously--call
add_swap_count_continuation.

The add_swap_count_continuation call was added in 570a335b8e22 and relies on
the assumption that callers can only get -ENOMEM from swap_duplicate. This
patch changes that assumption.

Not a big deal: the continuation call just returns early, no harm done, but it
allocs and frees a page needlessly, so we should fix it. One way is to change
copy_one_pte's return to int so we can just pass the error code back to
copy_pte_range so it knows whether to try adding the continuation.

The other swap_duplicate caller, try_to_unmap_one, seems ok.

> But the error may be possible during development, and it
> may serve as some kind of document too. So I suggest to add
>
> VM_BUG_ON(error != -ENOMEM);
>
> in swap_duplicate(). What do you think about that?

That doesn't seem necessary.

> > So it might make sense to have a separate patch that changes swap_duplicate's
> > return and makes callers handle it.
>
> Thanks for your help to take a deep look at this. I want to try to fix
> all potential problems firstly, because the number of the caller is
> quite limited. Do you agree?

Yes, makes sense to me.

Daniel