Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

From: Jason Gunthorpe
Date: Wed Aug 14 2019 - 19:58:10 EST


On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
>
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
>
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."

But this describes fs_reclaim_acquire() - is there some reason we are
conflating fs_reclaim with non-sleeping?

ie is there some fundamental difference between the block stack
sleeping during reclaim while it waits for a driver to write out a
page and a GPU driver sleeping during OOM while it waits for it's HW
to fence DMA on a page?

Fundamentally we have invalidate_range_start() vs invalidate_range()
as the start() version is able to sleep. If drivers can do their work
without sleeping then they should be using invalidare_range() instead.

Thus, it doesn't seem to make any sense to ask a driver that requires a
sleeping API not to sleep.

AFAICT what is really going on here is that drivers care about only a
subset of the VA space, and we want to query the driver if it cares
about the range proposed to be OOM'd, so we can OOM ranges that are
do not have SPTEs.

ie if you look pretty much all drivers do exactly as
userptr_mn_invalidate_range_start() does, and bail once they detect
the VA range is of interest.

So, I'm working on a patch to lift the interval tree into the notifier
core and then do the VA test OOM needs without bothering the
driver. Drivers can retain the blocking API they require and OOM can
work on VA's that don't have SPTEs.

This approach also solves the critical bug in this path:
https://lore.kernel.org/linux-mm/20190807191627.GA3008@xxxxxxxx/

And solves a bunch of other bugs in the drivers.

> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.

Again, this entirely sounds like fs_reclaim - isn't that exactly what
it is for?

I have had on my list a second and very related possible bug. I ran
into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in
advance") which says that mapping->i_mmap_mutex is under fs_reclaim().

We do hold i_mmap_rwsem while calling invalidate_range_start():

unmap_mapping_pages
i_mmap_lock_write(mapping); // ie i_mmap_rwsem
unmap_mapping_range_tree
unmap_mapping_range_vma
zap_page_range_single
mmu_notifier_invalidate_range_start

So, if it is still true that i_mmap_rwsem is under fs_reclaim then
invalidate_range_start is *always* under fs_reclaim anyhow! (this I do
not know)

Thus we should use lockdep to force this and fix all the drivers.

.. and if we force fs_reclaim always, do we care about blockable
anymore??

Jason