Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

From: Theodore Y. Ts'o
Date: Wed Nov 24 2021 - 19:34:53 EST


On Wed, Nov 24, 2021 at 04:23:31PM +1100, NeilBrown wrote:
>
> It would get particularly painful if some system call started returned
> -ENOMEM, which had never returned that before. I note that ext4 uses
> __GFP_NOFAIL when handling truncate. I don't think user-space would be
> happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a
> recent commit which adds it focuses more on wanting to avoid the need
> for fsck.

If the inode is in use (via an open file descriptor) when it is
unlocked, we can't actually do the truncate until the inode is
evicted, and at that point, there is no user space to return to. For
that reason, the evict_inode() method is not *allowed* to fail. So
this is why we need to use GFP_NOFAIL or an open-coded retry loop.
The alternative would be to mark the file system corrupt, and then
either remount the file system, panic the system and reboot, or leave
the file system corrupted ("don't worry, be happy"). I considered
GFP_NOFAIL to be the lesser of the evils. :-)

If the VFS allowed evict_inode() to fail, all it could do is to put
the inode back on the list of inodes to be later evicted --- which is
to say, we would have to add a lot of complexity to effectively add a
gigantic retry loop.

Granted, we wouldn't need to be holding any locks in between retries,
so perhaps it'a better than adding a retry loop deep in the guts of
the ext4 truncate codepath. But then we would need to worry about
userspace getting ENOMEM for system calls which historically, users
have traditionally never failing. I suppose we could also solve this
problem by adding retry logic in the top-level VFS truncate codepath,
so instead of returning ENOMEM, we just retry the truncate(2) system
call and hope that we have enough memory to succeed this time.

After all, can the userspace do if truncate() fails with ENOMEM? It
can fail the userspace program, which in the case of a long-running
daemon such as mysqld, is basically the userspace equivalent of "panic
and reboot", or it can retry truncate(2) syste call at the userspace
level.

Are we detecting a pattern here? There will always be cases where the
choice is "panic" or "retry".

- Ted