Re: [RFC][PATCH 00/16] sched: Core scheduling

From: Subhra Mazumdar
Date: Mon Mar 11 2019 - 20:52:14 EST



On 3/11/19 5:20 PM, Greg Kerr wrote:
On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
<subhra.mazumdar@xxxxxxxxxx> wrote:

On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
On 3/10/19 9:23 PM, Aubrey Li wrote:
On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
<subhra.mazumdar@xxxxxxxxxx> wrote:
expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.

After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

It seems you are facing some other deadlock where printk is involved.
Can you
drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra

Never Mind, I am seeing the same lockdep deadlock output even w/o patch
16. Btw
the NULL fix had something missing, following works.
Is this panic below, which occurs when I tag the first process,
related or known? If not, I will debug it tomorrow.

[ 46.831828] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000000
[ 46.831829] core sched enabled
[ 46.834261] #PF error: [WRITE]
[ 46.834899] PGD 0 P4D 0
[ 46.835438] Oops: 0002 [#1] SMP PTI
[ 46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
[ 46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
[ 46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
00 00 00 <f0> 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
00 00
[ 46.843000] RSP: 0018:ffffb9d300cabe38 EFLAGS: 00010046
[ 46.843744] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 46.844709] RDX: 0000000000000001 RSI: ffffffffaea435ae RDI: 0000000000000000
[ 46.845689] RBP: ffffb9d300cabed8 R08: 0000000000000000 R09: 0000000000020800
[ 46.846651] R10: ffffffffaf603ea0 R11: 0000000000000001 R12: ffffffffaf6576c0
[ 46.847619] R13: ffff9a57366c8000 R14: ffff9a5737401300 R15: ffffffffade868f0
[ 46.848584] FS: 0000000000000000(0000) GS:ffff9a5737a00000(0000)
knlGS:0000000000000000
[ 46.849680] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 46.850455] CR2: 0000000000000000 CR3: 00000001d36fa000 CR4: 00000000000006f0
[ 46.851415] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 46.852371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 46.853326] Call Trace:
[ 46.853678] __schedule+0x139/0x11f0
[ 46.854167] ? cpumask_next+0x16/0x20
[ 46.854668] ? cpu_stop_queue_work+0xc0/0xc0
[ 46.855252] ? sort_range+0x20/0x20
[ 46.855742] schedule+0x4e/0x60
[ 46.856171] smpboot_thread_fn+0x12a/0x160
[ 46.856725] kthread+0x112/0x120
[ 46.857164] ? kthread_stop+0xf0/0xf0
[ 46.857661] ret_from_fork+0x35/0x40
[ 46.858146] Modules linked in:
[ 46.858562] CR2: 0000000000000000
[ 46.859022] ---[ end trace e9fff08f17bfd2be ]---

- Greg

This seems to be different