Re: [PATCH v4 bpf-next 5/8] bpf: use module_alloc_huge for bpf_prog_pack

From: Edgecombe, Rick P
Date: Tue May 24 2022 - 12:43:15 EST


On Tue, 2022-05-24 at 10:22 +0300, Mike Rapoport wrote:
> > @@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct
> > bpf_binary_header *hdr)
> >
> > mutex_lock(&pack_mutex);
> > if (hdr->size > bpf_prog_pack_size) {
> > + set_memory_nx((unsigned long)hdr, hdr->size /
> > PAGE_SIZE);
> > + set_memory_rw((unsigned long)hdr, hdr->size /
> > PAGE_SIZE);
>
> set_memory_{nx,rw} can fail. Please take care of error handling.

I think there is nothing to do here, actually. At least on the freeing
part.

When called on a vmalloc primary address, the set_memory() calls will
try to first change the permissions on the vmalloc alias, then try to
change it on the direct map. Yea, it can fail from failing to allocate
a page from a split, but whatever memory managed to change earlier will
also succeed to change back on reset. The set_memory() functions don't
rollback on failure. So all the modules callers depend on this logic of
a second resetting call to reset anything that succeeded to change on
the first call. The split will have already happened for any memory
that succeeded to change, and the order of the changes is the same.

As for the primary alias, for 4k vmallocs, it will always succeed, and
set_memory() does this part first, so set_memory_x() will
(crucially) always succeed for kernel text mappings. For 2MB vmalloc
pages, the primary alias should succeed if the set_memory() call is 2MB
aligned.

So pre-VM_FLUSH_RESET_PERMS (and it also has pretty similar logic),
they all went something like this:
Allocate:
1. ptr = vmalloc();
2. set_memory_ro(ptr); <-Could fail halfway though

Free:
3. set_memory_rw(ptr); <-Could also fail halfway though and return an
error, but only after the split work done in
step 2.
4. vfree(ptr);

It's pretty horrible. BPF people once tried to be more proper and
change it to:
ptr = vmalloc()
if (set_memory_ro())
{
vfree(ptr);
}

It looks correct, but this caused RO pages to be freed if
set_memory_ro() half succeeded in changing permissions. If there is an
error on reset, it means the first set_memory() call failed to change
everything, but anything that succeeded is reset. So the right thing to
do is to free the pages in either case.

We could fail to allocate a pack if the initial set_memory() calls
failed, but we still have to call set_memory_rw(), etc on free and
ignore any errors.

As an aside, this is one of the reasons it's hard to improve things
incrementally here. Everything is carefully balanced like a house of
cards. The solution IMO, is to change the interface so this type of
logic is in one place instead of scattered in the callers.