Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

From: Huang, Ying
Date: Tue Apr 13 2021 - 23:07:19 EST


Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:

> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@xxxxxxxxxx> writes:
>>
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1 CPU 2
>>> ----- -----
>>> do_swap_page
>>> synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>> release swap_file, bdev, or ...
>>> swap_readpage
>>> check sis->flags is ok
>>> access swap_file, bdev...[oops!]
>>> si->flags = 0
>>>
>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>> time. And this race may not be really pernicious because swapoff is usually
>>> done when system shutdown only. To reduce the performance overhead on the
>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>> this race window(as suggested by Huang, Ying).
>>>
>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>
>> This isn't the commit that introduces the race. You can use `git blame`
>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>>
>
> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
> swapoff and some swap operations"). And I think this commit does not fix the race
> condition completely, so I reuse the Fixes tag inside it.
>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx>
>>> ---
>>> include/linux/swap.h | 2 +-
>>> mm/memory.c | 10 ++++++++++
>>> mm/swapfile.c | 28 +++++++++++-----------------
>>> 3 files changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 849ba5265c11..9066addb57fd 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>
>>> static inline void put_swap_device(struct swap_info_struct *si)
>>> {
>>> - rcu_read_unlock();
>>> + percpu_ref_put(&si->users);
>>> }
>>>
>>> #else /* CONFIG_SWAP */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index cc71a445c76c..8543c47b955c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page = NULL, *swapcache;
>>> + struct swap_info_struct *si = NULL;
>>> swp_entry_t entry;
>>> pte_t pte;
>>> int locked;
>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>
>>>
>>
>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>
>> /* Prevent swapoff from happening to us */
>
> Ok.
>
>>
>>> + si = get_swap_device(entry);
>>> + /* In case we raced with swapoff. */
>>> + if (unlikely(!si))
>>> + goto out;
>>> +
>>
>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>> now. We can remove several get/put_swap_device() for function called by
>> do_swap_page(). That can be another optimization patch.
>
> I tried to remove several get/put_swap_device() for function called
> by do_swap_page() only before I send this series. But it seems they have
> other callers without proper get/put_swap_device().

Then we need to revise these callers instead. Anyway, can be another
series.

Best Regards,
Huang, Ying