Re: [PATCH v2 00/28] The new cgroup slab memory controller

From: Bharata B Rao
Date: Tue Sep 01 2020 - 01:28:44 EST


On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
>
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine shutdown.
>
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for
> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
>
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
> do {
> pfn = scan_movable_pages(pfn, end_pfn);
> # Returns 0, meaning there is nothing available to
> # migrate, no page is PageLRU(page)
> ...
> ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> NULL, check_pages_isolated_cb);
> # Returns -EBUSY, meaning there is at least one PFN that
> # still has to be migrated.
> } while (ret);
>
> Thread #2: ccs killer kthread
> css_killed_work_fn
> cgroup_mutex <- Grab this Mutex
> mem_cgroup_css_offline
> memcg_offline_kmem.part
> memcg_deactivate_kmem_caches
> get_online_mems
> mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: systemd
> ksys_read
> vfs_read
> __vfs_read
> seq_read
> proc_single_show
> proc_cgroup_show
> mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
>
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
>
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.

We too are seeing this on Power systems when stress-testing memory
hotplug, but with the following call trace (from hung task timer)
instead of Thread #2 above:

__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
get_online_mems
memcg_create_kmem_cache
memcg_kmem_cache_create_func
process_one_work
worker_thread
kthread
ret_from_kernel_thread

While I understand that Roman's new slab controller patchset will fix
this, I also wonder if infinitely looping in the memory unplug path
with mem_hotplug_lock held is the right thing to do? Earlier we had
a few other exit possibilities in this path (like max retries etc)
but those were removed by commits:

72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory

Or, is the user-space test is expected to induce a signal back-off when
unplug doesn't complete within a reasonable amount of time?

Regards,
Bharata.