Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

From: Chengming Zhou
Date: Wed Jan 24 2024 - 02:22:21 EST


On 2024/1/24 05:02, Yosry Ahmed wrote:
> On Tue, Jan 23, 2024 at 12:12 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>>
>> On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote:
>>> On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>>>>
>>>> On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote:
>>>>> On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>>>>>>
>>>>>> On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote:
>>>>>>> During swapoff, try_to_unuse() makes sure that zswap_invalidate() is
>>>>>>> called for all swap entries before zswap_swapoff() is called. This means
>>>>>>> that all zswap entries should already be removed from the tree. Simplify
>>>>>>> zswap_swapoff() by removing the tree cleanup loop, and leaving an
>>>>>>> assertion in its place.
>>>>>>>
>>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
>>>>>>
>>>>>> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>>>>>>
>>>>>> That's a great simplification.
>>>>>>
>>>>>> Removing the tree->lock made me double take, but at this point the
>>>>>> swapfile and its cache should be fully dead and I don't see how any of
>>>>>> the zswap operations that take tree->lock could race at this point.
>>>>>
>>>>> It took me a while staring at the code to realize this loop is pointless.
>>>>>
>>>>> However, while I have your attention on the swapoff path, there's a
>>>>> slightly irrelevant problem that I think might be there, but I am not
>>>>> sure.
>>>>>
>>>>> It looks to me like swapoff can race with writeback, and there may be
>>>>> a chance of UAF for the zswap tree. For example, if zswap_swapoff()
>>>>> races with shrink_memcg_cb(), I feel like we may free the tree as it
>>>>> is being used. For example if zswap_swapoff()->kfree(tree) happen
>>>>> right before shrink_memcg_cb()->list_lru_isolate(l, item).
>>>>>
>>>>> Please tell me that I am being paranoid and that there is some
>>>>> protection against zswap writeback racing with swapoff. It feels like
>>>>> we are very careful with zswap entries refcounting, but not with the
>>>>> zswap tree itself.
>>>>
>>>> Hm, I don't see how.
>>>>
>>>> Writeback operates on entries from the LRU. By the time
>>>> zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should
>>>> will have emptied out the LRU and tree.
>>>>
>>>> Writeback could have gotten a refcount to the entry and dropped the
>>>> tree->lock. But then it does __read_swap_cache_async(), and while
>>>> holding the page lock checks the tree under lock once more; if that
>>>> finds the entry valid, it means try_to_unuse() hasn't started on this
>>>> page yet, and would be held up by the page lock/writeback state.
>>>
>>> Consider the following race:
>>>
>>> CPU 1 CPU 2
>>> # In shrink_memcg_cb() # In swap_off
>>> list_lru_isolate()
>>> zswap_invalidate()
>>> ..
>>> zswap_swapoff() -> kfree(tree)
>>> spin_lock(&tree->lock);
>>>
>>> Isn't this a UAF or am I missing something here?
>>
>> Oof. You're right, it looks like there is a bug. Digging through the
>> history, I think this is actually quite old: the original backend
>> shrinkers would pluck something off their LRU, drop all locks, then
>> try to acquire tree->lock. There is no protection against swapoff.
>>
>> The lock that is supposed to protect this is the LRU lock. That's
>> where reclaim and invalidation should synchronize. But it's not right:
>>
>> 1. We drop the LRU lock before acquiring the tree lock. We should
>> instead trylock the tree while still holding the LRU lock to make
>> sure the tree is safe against swapoff.
>>
>> 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But
>> it must always cycle the LRU lock before freeing the tree for that
>> synchronization to work.
>>
>> Once we're holding a refcount to the entry, it's safe to drop all
>> locks for the next step because we'll then work against the swapcache
>> and entry: __read_swap_cache_async() will try to pin and lock whatever
>> swap entry is at that type+offset. This also pins the type's current
>> tree. HOWEVER, if swapoff + swapon raced, this could be a different
>> tree than what we had in @tree, so
>>
>> 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to
>> look up zswap_trees[] again after __read_swap_cache_async()
>> succeeded to validate the entry.
>>
>> Once it succeeded, we can validate the entry. The entry is valid due
>> to our refcount. The zswap_trees[type] is valid due to the cache pin.
>>
>> However, if validation failed and we have a non-zero writeback_result,
>> there is one last bug:
>>
>> 4. the original entry's tree is no longer valid for the entry put.
>>
>> The current scheme handles invalidation fine (which is good because
>> that's quite common). But it's fundamentally unsynchronized against
>> swapoff (which has probably gone undetected because that's rare).
>>
>> I can't think of an immediate solution to this, but I wanted to put my
>> analysis out for comments.
>
>
> Thanks for the great analysis, I missed the swapoff/swapon race myself :)
>
> The first solution that came to mind for me was refcounting the zswap
> tree with RCU with percpu-refcount, similar to how cgroup refs are
> handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
> the percpu-refcount may be an overkill in terms of memory usage
> though. I think we can still do our own refcounting with RCU, but it
> may be more complicated.
Hello,

I also thought about this problem for some time, maybe something like below
can be changed to fix it? It's likely I missed something, just some thoughts.

IMHO, the problem is caused by the different way in which we use zswap entry
in the writeback, that should be much like zswap_load().

The zswap_load() comes in with the folio locked in swap cache, so it has
stable zswap tree to search and lock... But in writeback case, we don't,
shrink_memcg_cb() comes in with only a zswap entry with lru list lock held,
then release lru lock to get tree lock, which maybe freed already.

So we should change here, we read swpentry from entry with lru list lock held,
then release lru lock, to try to lock corresponding folio in swap cache,
if we success, the following things is much the same like zswap_load().
We can get tree lock, to recheck the invalidate race, if no race happened,
we can make sure the entry is still right and get refcount of it, then
release the tree lock.

The main differences between this writeback with zswap_load() is the handling
of lru entry and the tree lifetime. The whole zswap_load() function has the
stable reference of zswap tree, but it's not for shrink_memcg_cb() bottom half
after __swap_writepage() since we unlock the folio after that. So we can't
reference the tree after that.

This problem is easy to fix, we can zswap_invalidate_entry(tree, entry) early
in tree lock, since thereafter writeback can't fail. BTW, I think we should
also zswap_invalidate_entry() early in zswap_load() and only support the
zswap_exclusive_loads_enabled mode, but that's another topic.

The second difference is the handling of lru entry, which is easy that we
just zswap_lru_del() in tree lock.

So no any path can reference the entry from tree or lru after we release
the tree lock, so we can just zswap_free_entry() after writeback.

Thanks!

// lru list lock held
shrink_memcg_cb()
swpentry = entry->swpentry
// Don't isolate entry from lru list here, just use list_lru_putback()
spin_unlock(lru list lock)

folio = __read_swap_cache_async(swpentry)
if (!folio)
return

if (!folio_was_allocated)
folio_put(folio)
return

// folio is locked, swapcache is secured against swapoff
tree = get tree from swpentry
spin_lock(&tree->lock)

// check invalidate race? No
if (entry == zswap_rb_search())

// so we can make sure this entry is still right
// zswap_invalidate_entry() since the below writeback can't fail
zswap_entry_get(entry)
zswap_invalidate_entry(tree, entry)

// remove from lru list
zswap_lru_del()

spin_unlock(&tree->lock)

__zswap_load()

__swap_writepage() // folio unlock
folio_put(folio)

// entry is safe to free, since it's removed from tree and lru above
zswap_free_entry(entry)