Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff
From: Chengming Zhou
Date:  Sat Jan 27 2024 - 10:13:33 EST
On 2024/1/27 03:50, Nhat Pham wrote:
> On Fri, Jan 26, 2024 at 12:32 AM <chengming.zhou@xxxxxxxxx> wrote:
>>
>> From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>
>>
>> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>>
>> CPU1                    CPU2
>> shrink_memcg_cb         swap_off
>>   list_lru_isolate        zswap_invalidate
>>                           zswap_swapoff
>>                             kfree(tree)
>>   // UAF
>>   spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and return with folio locked, after which we can reference the tree.
>> Then check invalidate race with tree lock, the following things is
>> much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
> 
> I added list_lru_putback to the list_lru API specifically for this use
> case (zswap_lru_putback()). Now that we no longer need it, maybe we
> can also remove this as well (assuming no-one else is using this?).
> 
> This can be done in a separate patch though.
Right, I can append a patch to remove it since no other users.
> 
>> in the LRU list so concurrent reclaimers have little chance to see it.
>> Or it will be deleted from LRU list if writeback success.
>>
>> Another confusing part to me is the update of memcg nr_zswap_protected
>> in zswap_lru_putback(). I'm not sure why it's needed here since
>> if we raced with swapin, memcg nr_zswap_protected has already been
>> updated in zswap_folio_swapin(). So not include this part for now.
>>
>> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@xxxxxxxxxxxxxx/
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>
> 
> LGTM! This is quite elegant.
> Acked-by: Nhat Pham <nphamcs@xxxxxxxxx>
> 
>> ---
>>  mm/zswap.c | 93 ++++++++++++++++++------------------------------------
>>  1 file changed, 31 insertions(+), 62 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 81cb3790e0dd..fa2bdb7ec1d8 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>>                  zpool_get_type((p)->zpools[0]))
>>
>>  static int zswap_writeback_entry(struct zswap_entry *entry,
>> -                                struct zswap_tree *tree);
>> +                                swp_entry_t swpentry);
>>  static int zswap_pool_get(struct zswap_pool *pool);
>>  static void zswap_pool_put(struct zswap_pool *pool);
>>
>> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>>         rcu_read_unlock();
>>  }
>>
>> -static void zswap_lru_putback(struct list_lru *list_lru,
>> -               struct zswap_entry *entry)
>> -{
>> -       int nid = entry_to_nid(entry);
>> -       spinlock_t *lock = &list_lru->node[nid].lock;
>> -       struct mem_cgroup *memcg;
>> -       struct lruvec *lruvec;
>> -
>> -       rcu_read_lock();
>> -       memcg = mem_cgroup_from_entry(entry);
>> -       spin_lock(lock);
>> -       /* we cannot use list_lru_add here, because it increments node's lru count */
>> -       list_lru_putback(list_lru, &entry->lru, nid, memcg);
>> -       spin_unlock(lock);
>> -
>> -       lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
>> -       /* increment the protection area to account for the LRU rotation. */
>> -       atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>> -       rcu_read_unlock();
>> -}
>> -
>>  /*********************************
>>  * rbtree functions
>>  **********************************/
>> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>>  {
>>         struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>>         bool *encountered_page_in_swapcache = (bool *)arg;
>> -       struct zswap_tree *tree;
>> -       pgoff_t swpoffset;
>> +       swp_entry_t swpentry;
>>         enum lru_status ret = LRU_REMOVED_RETRY;
>>         int writeback_result;
>>
>> +       /*
>> +        * First rotate to the tail of lru list before unlocking lru lock,
>> +        * so the concurrent reclaimers have little chance to see it.
>> +        * It will be deleted from the lru list if writeback success.
>> +        */
>> +       list_move_tail(item, &l->list);
>> +
>>         /*
>>          * Once the lru lock is dropped, the entry might get freed. The
>> -        * swpoffset is copied to the stack, and entry isn't deref'd again
>> +        * swpentry is copied to the stack, and entry isn't deref'd again
>>          * until the entry is verified to still be alive in the tree.
>>          */
>> -       swpoffset = swp_offset(entry->swpentry);
>> -       tree = swap_zswap_tree(entry->swpentry);
>> -       list_lru_isolate(l, item);
> 
> nit: IIUC, now that we're no longer removing the entry from the
> list_lru, we protect against concurrent shrinking action via this
> check inside zswap_writeback_entry() too right:
> 
> if (!folio_was_allocated) {
>    folio_put(folio);
>    return -EEXIST;
> }
> 
> Maybe update the comment above it too?
	 * Found an existing folio, we raced with load/swapin. We generally
	 * writeback cold folios from zswap, and swapin means the folio just
	 * became hot. Skip this folio and let the caller find another one.
So now found an existing folio not only means load/swapin, and also concurrent
shrinking action. Yes, this comment needs to be changed a little.
Thanks.