Re: [PATCH v2 00/28] The new cgroup slab memory controller

From: Pavel Tatashin
Date: Tue Sep 01 2020 - 08:53:38 EST


On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@xxxxxxxxxxxxx> wrote:
>
> On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine shutdown.
> >
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> >
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> > do {
> > pfn = scan_movable_pages(pfn, end_pfn);
> > # Returns 0, meaning there is nothing available to
> > # migrate, no page is PageLRU(page)
> > ...
> > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > NULL, check_pages_isolated_cb);
> > # Returns -EBUSY, meaning there is at least one PFN that
> > # still has to be migrated.
> > } while (ret);
> >
> > Thread #2: ccs killer kthread
> > css_killed_work_fn
> > cgroup_mutex <- Grab this Mutex
> > mem_cgroup_css_offline
> > memcg_offline_kmem.part
> > memcg_deactivate_kmem_caches
> > get_online_mems
> > mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: systemd
> > ksys_read
> > vfs_read
> > __vfs_read
> > seq_read
> > proc_single_show
> > proc_cgroup_show
> > mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> >
> > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > to thread #1.
> >
> > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > still present in all stable branches. Unfortunately, I do not see a
> > simple fix in how to remove mem_hotplug_lock from
> > memcg_deactivate_kmem_caches without using Roman's series that is too
> > big for stable.
>
> We too are seeing this on Power systems when stress-testing memory
> hotplug, but with the following call trace (from hung task timer)
> instead of Thread #2 above:
>
> __switch_to
> __schedule
> schedule
> percpu_rwsem_wait
> __percpu_down_read
> get_online_mems
> memcg_create_kmem_cache
> memcg_kmem_cache_create_func
> process_one_work
> worker_thread
> kthread
> ret_from_kernel_thread
>
> While I understand that Roman's new slab controller patchset will fix
> this, I also wonder if infinitely looping in the memory unplug path
> with mem_hotplug_lock held is the right thing to do? Earlier we had
> a few other exit possibilities in this path (like max retries etc)
> but those were removed by commits:
>
> 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
>
> Or, is the user-space test is expected to induce a signal back-off when
> unplug doesn't complete within a reasonable amount of time?

Hi Bharata,

Thank you for your input, it looks like you are experiencing the same
problems that I observed.

What I found is that the reason why our machines did not complete
hot-remove within the given time is because of this bug:
https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@xxxxxxxxxx

Could you please try it and see if that helps for your case?

Thank you,
Pasha