Re: [PATCH 2/2] memcg: coalescing charge by percpu (Oct/9)

From: KAMEZAWA Hiroyuki
Date: Wed Oct 14 2009 - 03:06:21 EST


On Wed, 14 Oct 2009 15:42:11 +0900
Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx> wrote:

> It clears WORK_STRUCT_PENDING bit and initializes the list_head of the work_struct.
> So I think a following race can happen.
>
> work_pending() ->false
> INIT_WORK()
> schedule_work_on()
> queue_work_on()
> work_pending() -> false
> test_and_seet_bit()
> -> sets WORK_STRUCT_PENDING
> INIT_WORK() -> clears WORK_STRUCT_PENDING
>
>
> And actually, I've seen BUG several times in testing mmotm-2009-10-09-01-07 + these 2 patches.
> This seems not to happen on plain mmotm-2009-10-09-01-07.
>
>
> [ 2264.213803] ------------[ cut here ]------------
> [ 2264.214058] WARNING: at lib/list_debug.c:30 __list_add+0x6e/0x87()
> [ 2264.214058] Hardware name: Express5800/140Rd-4 [N8100-1065]
> [ 2264.214058] list_add corruption. prev->next should be next (ffff880029fd8e80), but was
> ffff880029fd36b8. (prev=ffff880029fd36b8).
> [ 2264.214058] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables b
> ridge stp autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_
> cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_t
> cp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc bat
> tery ac lp sg ide_cd_mod cdrom serio_raw acpi_memhotplug button parport_pc parport rtc_cmo
> s rtc_core rtc_lib e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash dm_
> log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd uhci_
> hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
> [ 2264.214058] Pid: 14398, comm: shmem_test_02 Tainted: G D W 2.6.32-rc3-mmotm-2009-
> 10-09-01-07-00249-g1112566 #2
> [ 2264.214058] Call Trace:
> [ 2264.214058] [<ffffffff811e0f43>] ? __list_add+0x6e/0x87
> [ 2264.214058] [<ffffffff8104af21>] warn_slowpath_common+0x7c/0x94
> [ 2264.214058] [<ffffffff8104afb8>] warn_slowpath_fmt+0x69/0x6b
> [ 2264.214058] [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
> [ 2264.214058] [<ffffffff810b819d>] ? probe_workqueue_insertion+0x40/0x9f
> [ 2264.214058] [<ffffffff81076720>] ? trace_hardirqs_off+0xd/0xf
> [ 2264.214058] [<ffffffff81386bef>] ? _spin_unlock_irqrestore+0x3d/0x4c
> [ 2264.214058] [<ffffffff811e0f43>] __list_add+0x6e/0x87
> [ 2264.214058] [<ffffffff81065230>] insert_work+0x78/0x9a
> [ 2264.214058] [<ffffffff81065c47>] __queue_work+0x2f/0x43
> [ 2264.214058] [<ffffffff81065cdb>] queue_work_on+0x44/0x50
> [ 2264.214058] [<ffffffff81065d02>] schedule_work_on+0x1b/0x1d
> [ 2264.214058] [<ffffffff81104623>] mem_cgroup_hierarchical_reclaim+0x232/0x40e
> [ 2264.214058] [<ffffffff81077eeb>] ? trace_hardirqs_on+0xd/0xf
> [ 2264.214058] [<ffffffff81104ec9>] __mem_cgroup_try_charge+0x14c/0x25b
> [ 2264.214058] [<ffffffff811058df>] mem_cgroup_charge_common+0x55/0x82
> [ 2264.214058] [<ffffffff810d5f9d>] ? shmem_swp_entry+0x134/0x140
> [ 2264.214058] [<ffffffff81106a44>] mem_cgroup_cache_charge+0xef/0x118
> [ 2264.214058] [<ffffffff810f766e>] ? alloc_page_vma+0xe0/0xef
> [ 2264.214058] [<ffffffff810d689a>] shmem_getpage+0x6a7/0x86b
> [ 2264.214058] [<ffffffff81077eeb>] ? trace_hardirqs_on+0xd/0xf
> [ 2264.214058] [<ffffffff810ce1d4>] ? get_page_from_freelist+0x4fc/0x6ce
> [ 2264.214058] [<ffffffff81386794>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 2264.214058] [<ffffffff8100e385>] ? do_softirq+0x77/0x86
> [ 2264.214058] [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
> [ 2264.214058] [<ffffffff810502bf>] ? current_fs_time+0x27/0x2e
> [ 2264.214058] [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
> [ 2264.214058] [<ffffffff811236c2>] ? file_update_time+0x44/0x144
> [ 2264.214058] [<ffffffff810d6b11>] shmem_fault+0x42/0x63
> [ 2264.214058] [<ffffffff810e07db>] __do_fault+0x55/0x484
> [ 2264.214058] [<ffffffff81077354>] ? print_lock_contention_bug+0x1b/0xe0
> [ 2264.214058] [<ffffffff810e28e1>] handle_mm_fault+0x1ea/0x813
> [ 2264.214058] [<ffffffff81389770>] do_page_fault+0x25a/0x2ea
> [ 2264.214058] [<ffffffff8138757f>] page_fault+0x1f/0x30
> [ 2264.214058] ---[ end trace d435092dd749cff5 ]---
> [ 2264.538186] ------------[ cut here ]------------
> [ 2264.538219] ------------[ cut here ]------------
> [ 2264.538231] kernel BUG at kernel/workqueue.c:287!
> [ 2264.538240] invalid opcode: 0000 [#2] SMP
> [ 2264.538251] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
> [ 2264.538259] CPU 0
> [ 2264.538270] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables b
> ridge stp autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_
> cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_t
> cp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc bat
> tery ac lp sg ide_cd_mod cdrom serio_raw acpi_memhotplug button parport_pc parport rtc_cmo
> s rtc_core rtc_lib e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash dm_
> log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd uhci_
> hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
> [ 2264.538627] Pid: 51, comm: events/0 Tainted: G D W 2.6.32-rc3-mmotm-2009-10-09-01
> -07-00249-g1112566 #2 Express5800/140Rd-4 [N8100-1065]
> [ 2264.538638] RIP: 0010:[<ffffffff81065f56>] [<ffffffff81065f56>] worker_thread+0x157/0x
> 2b3
> [ 2264.538663] RSP: 0000:ffff8803a068fe10 EFLAGS: 00010207
> [ 2264.538670] RAX: 0000000000000000 RBX: ffff8803a045a478 RCX: ffff880029fd36b8
> [ 2264.538679] RDX: 0000000000001918 RSI: 00000001a068fdd0 RDI: ffffffff81386bad
> [ 2264.538684] RBP: ffff8803a068feb0 R08: 0000000000000002 R09: 0000000000000001
> [ 2264.538694] R10: ffffffff81389860 R11: ffff880029fd6518 R12: ffff880029fd36b0
> [ 2264.538701] R13: ffff880029fd8e40 R14: ffff8803a067e590 R15: ffffffff811049f3
> [ 2264.538719] FS: 0000000000000000(0000) GS:ffff880029e00000(0000) knlGS:000000000000000
> 0
> [ 2264.538734] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 2264.538742] CR2: 00007ffcc30a0001 CR3: 0000000001001000 CR4: 00000000000006f0
> [ 2264.538755] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2264.538766] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 2264.538773] Process events/0 (pid: 51, threadinfo ffff8803a068e000, task ffff8803a067e5
> 90)
> [ 2264.538783] Stack:
> [ 2264.538788] ffffffff81065f5a ffffffff813847ec ffffffff827d6a20 0000000000000000
> [ 2264.538807] <0> ffffffff814f5a42 0000000000000002 0000000000000000 ffff8803a067e948
> [ 2264.538835] <0> 0000000000000000 ffff8803a067e590 ffffffff8106a61b ffff8803a068fe68
> [ 2264.538857] Call Trace:
> [ 2264.538868] [<ffffffff81065f5a>] ? worker_thread+0x15b/0x2b3
> [ 2264.538883] [<ffffffff813847ec>] ? thread_return+0x3e/0xee
> [ 2264.538887] [<ffffffff8106a61b>] ? autoremove_wake_function+0x0/0x3d
> [ 2264.538887] [<ffffffff81065dff>] ? worker_thread+0x0/0x2b3
> [ 2264.538887] [<ffffffff8106a510>] kthread+0x82/0x8a
> [ 2264.538887] [<ffffffff8100cc1a>] child_rip+0xa/0x20
> [ 2264.538887] [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
> [ 2264.538887] [<ffffffff8106a46d>] ? kthreadd+0xd5/0xf6
> [ 2264.538887] [<ffffffff8106a48e>] ? kthread+0x0/0x8a
> [ 2264.538887] [<ffffffff8100cc10>] ? child_rip+0x0/0x20
> [ 2264.538887] Code: 00 4c 89 ef 48 8b 08 48 8b 50 08 48 89 51 08 48 89 0a 48 89 40 08 48
> 89 00 e8 34 0c 32 00 49 8b 04 24 48 83 e0 fc 4c 39 e8 74 04 <0f> 0b eb fe f0 41 80 24 24 f
> e 49 8b bd a8 00 00 00 48 8d 9d 70
> [ 2264.538887] RIP [<ffffffff81065f56>] worker_thread+0x157/0x2b3
> [ 2264.538887] RSP <ffff8803a068fe10>
> [ 2264.539304] ---[ end trace d435092dd749cff6 ]---
>
>
> How about doing INIT_WORK() once in initialization like this ?
> After this patch, it has survived the same test for more than 6 hours w/o causing the BUG,
> while it survived for only 1 hour at most before.
>
> ===
> From: Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx>
>
> Don't do INIT_WORK() repeatedly against the same work_struct.
> Just do it once in initialization.
>
> Signed-off-by: Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>

Thanks. you're much quicker than me.

Regards,
-Kame

> ---
> mm/memcontrol.c | 13 ++++++++-----
> 1 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bc57916..fd8f65f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1349,8 +1349,8 @@ static void drain_all_stock_async(void)
> /* This function is for scheduling "drain" in asynchronous way.
> * The result of "drain" is not directly handled by callers. Then,
> * if someone is calling drain, we don't have to call drain more.
> - * Anyway, work_pending() will catch if there is a race. We just do
> - * loose check here.
> + * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> + * there is a race. We just do loose check here.
> */
> if (atomic_read(&memcg_drain_count))
> return;
> @@ -1359,9 +1359,6 @@ static void drain_all_stock_async(void)
> get_online_cpus();
> for_each_online_cpu(cpu) {
> struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> - if (work_pending(&stock->work))
> - continue;
> - INIT_WORK(&stock->work, drain_local_stock);
> schedule_work_on(cpu, &stock->work);
> }
> put_online_cpus();
> @@ -3327,11 +3324,17 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>
> /* root ? */
> if (cont->parent == NULL) {
> + int cpu;
> enable_swap_cgroup();
> parent = NULL;
> root_mem_cgroup = mem;
> if (mem_cgroup_soft_limit_tree_init())
> goto free_out;
> + for_each_possible_cpu(cpu) {
> + struct memcg_stock_pcp *stock =
> + &per_cpu(memcg_stock, cpu);
> + INIT_WORK(&stock->work, drain_local_stock);
> + }
> hotcpu_notifier(memcg_stock_cpu_callback, 0);
>
> } else {
> --
> 1.5.6.1
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/