[BUG] sched: leaf_cfs_rq_list use after free

From: Niklas Cassel
Date: Fri Mar 04 2016 - 05:41:42 EST


Hello

I've stumbled upon a use after free bug related to
CONFIG_FAIR_GROUP_SCHED / rq->cfs_rq->leaf_cfs_rq_list in v4.4.


Normally, a cfs_rq is immediately removed from the leaf_cfs_rq_list
and cfs_rq->onlist is set to 0, then the cfs_rq is freed at a later
time by call_rcu(&tg->rcu, free_sched_group_rcu).


What happens when we crash is that a cfs_rq is immediately removed
from the leaf_cfs_rq_list and cfs_rq->onlist is set to 0, however
then the cfs_rq is readded to the list, cfs_rq->onlist gets set to 1,
then comes the call to call_rcu(&tg->rcu, free_sched_group_rcu).

Now the cfs_rq is freed, filled with 0x6b6b6b6b by SLUB_DEBUG,
and still on the leaf_cfs_rq_list. Since the cfs_rq is still on
the list, the next call to update_blocked_averages will iterate
the list and will try to access members of the cfs_rq object,
an object which has already been freed.



[ 27.531374] Unable to handle kernel paging request at virtual address 6b6b706b
[ 27.538596] pgd = 8cea8000
[ 27.541295] [6b6b706b] *pgd=00000000
[ 27.544870] Internal error: Oops: 1 [#1] PREEMPT SMP ARM
[ 27.564025] CPU: 1 PID: 1252 Comm: logger Tainted: G O 4.4.0 #2
[ 27.571064] Hardware name: Axis ARTPEC-6 Platform
[ 27.575759] task: b9586540 ti: 8c84c000 task.ti: 8c84c000
[ 27.581155] PC is at update_blocked_averages+0xcc/0x748
[ 27.586372] LR is at update_blocked_averages+0xbc/0x748
[ 27.591589] pc : [<80051d78>] lr : [<80051d68>] psr: 200c0193
sp : 8c84dce8 ip : 00000500 fp : 8efb1680
[ 27.603056] r10: 00000006 r9 : 80847788 r8 : 6b6b6b6b
[ 27.608271] r7 : 00000007 r6 : ffff958a r5 : 00000007 r4 : ffff958a
[ 27.614789] r3 : 6b6b6b6b r2 : 00000101 r1 : 00000000 r0 : 00000003
[ 27.621308] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
[ 27.628521] Control: 10c5387d Table: 0cea804a DAC: 00000055
[ 27.634257] Process logger (pid: 1252, stack limit = 0x8c84c210)
[ 27.640254] Stack: (0x8c84dce8 to 0x8c84e000)
[ 27.644604] dce0: 6b6b6b6b 00000103 bad39440 80048250 00000000 bad398d0
[ 27.652774] dd00: bf6cf0d0 00000001 807e2c48 bad398d0 00000000 8054e7c8 ffff4582 bf6cec00
[ 27.660944] dd20: 00000001 8004825c 00000100 807dc400 8c84de40 bf6cb340 bad87ebc 00000100
[ 27.669114] dd40: afb50401 200c0113 00000200 807dc400 807e2100 ffff958a 00000007 8083916c
[ 27.677283] dd60: 00000100 00000006 0000001c 80058748 bf6cb340 8054e810 00000000 00000001
[ 27.685452] dd80: 807dc400 bf6cec00 00000001 bf6cec00 8083916c 00000001 c0803100 807dc400
[ 27.693622] dda0: 807e209c 000000a0 00000007 8083916c 00000100 00000006 0000001c 800282a0
[ 27.701791] ddc0: 00000001 bf6d2a80 b95fac00 0000000a ffff958b 00400000 bacf7000 807dc400
[ 27.709961] dde0: 00000000 00000000 0000001b bf0188c0 00000001 c0803100 b95fac00 80028830
[ 27.718130] de00: 807dc400 8006ca14 c0802100 c080210c 807e2db0 8081a140 8c84de40 80009420
[ 27.726300] de20: 8054e780 80122048 800c0013 ffffffff 8c84de74 00000001 00100073 800142c0
[ 27.734469] de40: b95ace70 b9586540 00000000 00000000 600c0013 00000000 024080c0 8010e8e0
[ 27.742639] de60: 00000001 00000001 00100073 b95fac00 00000000 8c84de90 8054e780 80122048
[ 27.750808] de80: 800c0013 ffffffff b95fac00 80122044 bad00640 8011c418 000001f6 b95acb70
[ 27.758978] dea0: 76f42000 b95acb70 76f42000 b95acb68 76f43000 8ce48780 00100073 8010e8e0
[ 27.767148] dec0: 00100073 00000000 b95fac00 00000000 00000000 00000001 b9421000 00000001
[ 27.775317] dee0: 00000000 76f46000 00000000 00000000 8001e8b8 76f42000 00000003 00000003
[ 27.783486] df00: b95fac00 8ce48780 00000001 00001000 807e2c64 8010efb4 00000000 00000000
[ 27.791656] df20: 0000004d 00000073 8c84df50 8ce487c4 b95fac00 00000003 00000013 00000000
[ 27.799825] df40: 8c84c000 b95fac00 7ece0b44 800faf84 00000002 00000000 00000000 8c84df64
[ 27.807995] df60: b95fac00 00000000 00000002 00000003 00000013 00000000 00000000 8010d4e8
[ 27.816163] df80: 00000002 00000000 00000003 00000003 00000000 00000003 000000c0 800104e4
[ 27.824333] dfa0: 00000020 800104b0 00000003 00000000 00000000 00000013 00000003 00000002
[ 27.832502] dfc0: 00000003 00000000 00000003 000000c0 0007ecd0 76f45958 76f45574 7ece0b44
[ 27.840671] dfe0: 00000000 7ece09fc 76f2e814 76f368d8 400c0010 00000000 00000000 00000000
[ 27.848847] [<80051d78>] (update_blocked_averages) from [<80058748>] (rebalance_domains+0x38/0x2cc)
[ 27.857889] [<80058748>] (rebalance_domains) from [<800282a0>] (__do_softirq+0x98/0x354)
[ 27.865975] [<800282a0>] (__do_softirq) from [<80028830>] (irq_exit+0xb0/0x11c)
[ 27.873281] [<80028830>] (irq_exit) from [<8006ca14>] (__handle_domain_irq+0x60/0xb8)
[ 27.881106] [<8006ca14>] (__handle_domain_irq) from [<80009420>] (gic_handle_irq+0x48/0x94)
[ 27.889452] [<80009420>] (gic_handle_irq) from [<800142c0>] (__irq_svc+0x40/0x74)
[ 27.896924] Exception stack(0x8c84de40 to 0x8c84de88)
[ 27.901969] de40: b95ace70 b9586540 00000000 00000000 600c0013 00000000 024080c0 8010e8e0
[ 27.910139] de60: 00000001 00000001 00100073 b95fac00 00000000 8c84de90 8054e780 80122048
[ 27.918306] de80: 800c0013 ffffffff
[ 27.921793] [<800142c0>] (__irq_svc) from [<80122048>] (__slab_alloc.constprop.9+0x28/0x2c)
[ 27.930139] [<80122048>] (__slab_alloc.constprop.9) from [<8011c418>] (kmem_cache_alloc+0x14c/0x204)
[ 27.939265] [<8011c418>] (kmem_cache_alloc) from [<8010e8e0>] (mmap_region+0x29c/0x680)
[ 27.947262] [<8010e8e0>] (mmap_region) from [<8010efb4>] (do_mmap+0x2f0/0x378)
[ 27.954481] [<8010efb4>] (do_mmap) from [<800faf84>] (vm_mmap_pgoff+0x74/0xa4)
[ 27.961699] [<800faf84>] (vm_mmap_pgoff) from [<8010d4e8>] (SyS_mmap_pgoff+0x94/0xf0)
[ 27.969524] [<8010d4e8>] (SyS_mmap_pgoff) from [<800104b0>] (__sys_trace_return+0x0/0x10)
[ 27.977694] Code: e59b8078 e59b309c e3a0cc05 e3580000 (e18300dc)

A snippet of the trace_printks I've added when analyzing the problem.
The prints show that a certain cfs_rq gets readded after it has been removed,
and that update_blocked_averages uses the cfs_rq which has already been freed:

systemd-1 [000] 22.664453: bprint: alloc_fair_sched_group: allocated cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 0
systemd-1 [000] 22.664479: bprint: alloc_fair_sched_group: allocated cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 0
systemd-1 [000] 22.664481: bprint: cpu_cgroup_css_alloc: tg 0x8efb1800 tg->css.id 0
systemd-1 [000] 22.664547: bprint: cpu_cgroup_css_online: tg 0x8efb1800 tg->css.id 80
systemd-874 [001] 27.389000: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
migrate_cert-820 [001] 27.421337: bprint: update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
kworker/0:1-24 [000] 27.421356: bprint: cpu_cgroup_css_offline: tg 0x8efb1800 tg->css.id 80
kworker/0:1-24 [000] 27.421445: bprint: list_del_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
migrate_cert-820 [001] 27.421506: bprint: list_add_leaf_cfs_rq: cfs_rq 0x8efb1680 cpu 1 on_list 0x0
system-status-815 [001] 27.491358: bprint: update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
kworker/0:1-24 [000] 27.501561: bprint: cpu_cgroup_css_free: tg 0x8efb1800 tg->css.id 80
migrate_cert-820 [001] 27.511337: bprint: update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x1
ksoftirqd/0-3 [000] 27.521830: bprint: free_fair_sched_group: freeing cfs_rq 0x8efb0780 tg 0x8efb1800 tg->css.id 80
ksoftirqd/0-3 [000] 27.521857: bprint: free_fair_sched_group: freeing cfs_rq 0x8efb1680 tg 0x8efb1800 tg->css.id 80
logger-1252 [001] 27.531355: bprint: update_blocked_averages: cfs_rq 0x8efb1680 cpu 1 on_list 0x6b6b6b6b


I've reproduced this on v4.4, but I've also managed to reproduce the bug
after cherry-picking the following patches
(all but one were marked for v4.4 stable):

6fe1f34 sched/cgroup: Fix cgroup entity load tracking tear-down
d6e022f workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
041bd12 Revert "workqueue: make sure delayed work run in local cpu"
8bb5ef7 cgroup: make sure a parent css isn't freed before its children
aa226ff cgroup: make sure a parent css isn't offlined before its children
e93ad19 cpuset: make mm migration asynchronous