Re: [PATCH] mm: Drop "PFNs busy" printk in an expected path.

From: Vlastimil Babka
Date: Mon Jan 02 2017 - 02:39:55 EST


On 12/30/2016 11:52 AM, Michal Hocko wrote:
> On Thu 29-12-16 23:22:20, Michal Nazarewicz wrote:
>> On Thu, Dec 29 2016, Eric Anholt wrote:
>>> Michal Hocko <mhocko@xxxxxxxxxx> writes:
>>>
>>>> This has been already brought up
>>>> http://lkml.kernel.org/r/20161130092239.GD18437@xxxxxxxxxxxxxx and there
>>>> was a proposed patch for that which ratelimited the output
>>>> http://lkml.kernel.org/r/20161130132848.GG18432@xxxxxxxxxxxxxx resp.
>>>> http://lkml.kernel.org/r/robbat2-20161130T195244-998539995Z@xxxxxxxxxxxxxxxxxx
>>>>
>>>> then the email thread just died out because the issue turned out to be a
>>>> configuration issue. Michal indicated that the message might be useful
>>>> so dropping it completely seems like a bad idea. I do agree that
>>>> something has to be done about that though. Can we reconsider the
>>>> ratelimit thing?

Agree about ratelimiting.

>>> I agree that the rate of the message has gone up during 4.9 -- it used
>>> to be a few per second.
>>
>> Sounds like a regression which should be fixed.
>>
>> This is why I donât think removing the message is a good idea. If you
>> suddenly see a lot of those messages, something changed for the worse.
>> If you remove this message, you will never know.
>
> I agree, that removing the message completely is not going to help to
> find out regressions. Swamping logs with zillions of messages is,
> however, not acceptable. It just causes even more problems. See the
> previous report.
>
>>> However, if this is an expected path during normal operation,
>>
>> This depends on your definition of âexpectedâ and ânormalâ.
>>
>> In general, I would argue that the fact those ever happen is a bug
>> somewhere in the kernel â if memory is allocated as movable, it should
>> be movable damn it!
>
> Yes, it should be movable but there is no guarantee it is movable
> immediately. Those pages might be pinned for some time. This is
> unavoidable AFAICS.

There was a VM_PINNED patchset some years ago from PeterZ where
long-term pins would use wrappers over get_page() that would e.g.
migrate the page from CMA blocks or movable zones. That's possible
solution, but it would always be a bit of a whack-a-mole with code that
would do longer than expected pins, but not use the VM_PINNED API.

> So while this might be a regression which should be investigated there
> should be another fix to prevent from swamping the logs as well.

Yeah, the logs indicated rather static pfn's being logged, so either
really long-term pins or maybe outright wrong migratetype used by the
allocation, possibly as regression. page_owner functionality would make
it possible to confirm the wrong migratetype and dump the allocating
stacktrace. Perhaps we can enhance the printk's here to do exactly that
automatically if page_owner is enabled, which would make it easier for
bug reporters.

If it's pinning, then it's trickier. Joonsoo added relevant tracepoints
recently, but it's easy to flood the system with tracing output,
especially when one would want backtraces of the pins.

It should be also possible to check for such problematic pages
periodically (outside of CMA attempts) via some script that would
combine kpagecount and page_owner output.