[PATCH 0/5] workqueue: fix bug when numa mapping is changed

From: Lai Jiangshan
Date: Fri Dec 12 2014 - 05:16:19 EST


Workqueue code has an assumption that the numa mapping is stable
after system booted. It is incorrectly currently.

Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
between CPU and node is changed. This was the last scene:
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656

Yasuaki Ishimatsu investigated that it happened in the following situation:

1) System Node/CPU before offline/online:
| CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

2) A system-board (contains node2 and node3) is offline:
| CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89

3) A new system-board is online, two new node IDs are allocated
for the two node of the SB, but the old CPU IDs are allocated for
the SB, here the NUMA mapping between node and CPU is changed.
(the node of CPU#30 is changed from node#2 to node#4, for example)
| CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
which is the convenient NUMA mapping cache in workqueue.c is still outdated.
thus pool->node calculated by get_unbound_pool() is incorrect.

5) when the create_worker() is called with the incorrect offlined
pool->node, it is failed and the pool can't make any progress.

To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
pool->node, it is done in patch2 and patch3.

patch1 fixes memory leak related wq_numa_possible_cpumask.
patch4 kill another assumption about how the numa mapping changed.
patch5 reduces the allocation fails when the node is offline or the node
is lack of memory.

The patchset is untested. It is sent for earlier review.

Thanks,
Lai.

Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@xxxxxxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@xxxxxxxxxxxxxx>
Cc: "Gu, Zheng" <guz.fnst@xxxxxxxxxxxxxx>
Cc: tangchen <tangchen@xxxxxxxxxxxxxx>
Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@xxxxxxxxxxxxxx>
Lai Jiangshan (5):
workqueue: fix memory leak in wq_numa_init()
workqueue: update wq_numa_possible_cpumask
workqueue: fixup existing pool->node
workqueue: update NUMA affinity for the node lost CPU
workqueue: retry on NUMA_NO_NODE when create_worker() fails

kernel/workqueue.c | 129 ++++++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 109 insertions(+), 20 deletions(-)

--
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/