Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change

From: Eric Dumazet
Date: Wed May 06 2020 - 10:33:56 EST


On Wed, May 6, 2020 at 5:59 AM SeongJae Park <sjpark@xxxxxxxxxx> wrote:
>
> TL; DR: It was not kernel's fault, but the benchmark program.
>
> So, the problem is reproducible using the lebench[1] only. I carefully read
> it's code again.
>
> Before running the problem occurred "poll big" sub test, lebench executes
> "context switch" sub test. For the test, it sets the cpu affinity[2] and
> process priority[3] of itself to '0' and '-20', respectively. However, it
> doesn't restore the values to original value even after the "context switch" is
> finished. For the reason, "select big" sub test also run binded on CPU 0 and
> has lowest nice value. Therefore, it can disturb the RCU callback thread for
> the CPU 0, which processes the deferred deallocations of the sockets, and as a
> result it triggers the OOM.
>
> We confirmed the problem disappears by offloading the RCU callbacks from the
> CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or
> priority.
>
> Someone _might_ still argue that this is kernel problem because the problem
> didn't occur on the old kernels prior to the Al's patches. However, setting
> the affinity and priority was available because the program received the
> permission. Therefore, it would be reasonable to blame the system
> administrators rather than the kernel.
>
> So, please ignore this patchset, apology for making confuse. If you still has
> some doubts or need more tests, please let me know.
>
> [1] https://github.com/LinuxPerfStudy/LEBench
> [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
> [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822
>
>
> Thanks,
> SeongJae Park

No harm done, thanks for running more tests and root-causing the issue !