Re: [PATCH] module, fix percpu reserved memory exhaustion

From: Prarit Bhargava
Date: Fri Jan 11 2013 - 09:16:30 EST




On 01/10/2013 10:48 PM, Rusty Russell wrote:
> Prarit Bhargava <prarit@xxxxxxxxxx> writes:
>> [ 15.478160] kvm: Could not allocate 304 bytes percpu data
>> [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc
>> from reserved chunk failed
> ...
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b). When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded. If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
>
> Wow, what a cool bug! Classic unforseen side-effect.
>
> I'd prefer not to do relocations with the module_lock held: it can be
> relatively slow. Yet we can't do relocations before the per-cpu
> allocation, obviously. Did you do boot timings before and after?

Heh ... I did! :) I had a lot of concerns about moving the mutex around so I
put in print at the end of boot to see how long the boot time actually was.

>From stock kernel:

[ 22.893015] PRARIT: FINAL BOOT MESSAGE

>From stock kernel + my patch:

[ 22.673214] PRARIT: FINAL BOOT MESSAGE

Both kernel boots showed the problem with kvm loading. A quick grep through my
bootlogs of stock kernel + my patch don't show anything greater than 23.539392
and less than 20.980321. Those numbers are similar to the numbers from the
stock kernel (23.569450 - 20.898321).

ie) I don't think there's an increase due to calling the relocation under the
module mutex, and if there is it is definitely lost within the noise of boot.

The timing were similar. I didn't see any huge delays, etc. Can the
relocations really cause a long delay? I thought we were pretty much writing
values to memory...

[I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory]

>
> An alternative would be to put the module into the list even earlier
> (say, just after layout_and_allocate) so we could block on concurrent
> loads at that point. But then we have to make sure noone looks in the
> module too early before it's completely set up, and that's complicated
> and error-prone too. A separate list is kind of icky.

Yeah -- that was my first attempt actually, and it got very complex very
quickly. I abandoned that approach in favor of moving the percpu allocations
under the lock. I thought that was likely the easiest approach.

>
> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
> allmodconfig build, there are only three modules with per-cpu data,
> totalling 328 bytes. So it's not reasonable to increase that number to
> paper over this.

I've been thinking about that. The problem is that at the same time the kvm
problem occurs I'm attempting to load a debug module that I've written to debug
some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
tomorrow if I have the need to increase the size of my log buffer.

... that is ;), I prefer your and my approach of fixing this problem.

>
> This is what a new boot state looks like (pains not to break ksplice).
> It's two patches, but I'll just post them back to back:
>
> module: add new state MODULE_STATE_UNFORMED
>
> You should never look at such a module, so it's excised from all paths
> which traverse the modules list.
>
> We add the state at the end, to avoid gratuitous ABI break (ksplice).
>
> Signed-off-by: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
>
<snip patch>

Sure, but I'm always nervous about expanding any state machine ;). That's just
me though :).

>
> module: put modules in list much earlier.
>
> Prarit's excellent bug report:
>> In recent Fedora releases (F17 & F18) some users have reported seeing
>> messages similar to
>>
>> [ 15.478160] kvm: Could not allocate 304 bytes percpu data
>> [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
>> reserved chunk failed
>>
>> during system boot. In some cases, users have also reported seeing this
>> message along with a failed load of other modules.
>>
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b). When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded. If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
>
> Now we have a new state MODULE_STATE_UNFORMED, we can insert the
> module into the list (and thus guarantee its uniqueness) before we
> allocate the per-cpu region.
>
> Reported-by: Prarit Bhargava <prarit@xxxxxxxxxx>
> Signed-off-by: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
>

<snip patch>

Tested-by: Prarit Bhargava <prarit@xxxxxxxxxx>

Rusty, you can change that to an Acked-by if you prefer that. I know some
engineers prefer one over the other. I'll also continue doing some reboot
testing and will email back in a few days to let you know what the timing looks
like.

Thanks!,

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/