Re: 5.6-rc3: WARNING: CPU: 48 PID: 17435 at kernel/sched/fair.c:380 enqueue_task_fair+0x328/0x440

From: Vincent Guittot
Date: Thu Mar 05 2020 - 07:33:58 EST


Le jeudi 05 mars 2020 à 13:12:39 (+0100), Dietmar Eggemann a écrit :
> On 05/03/2020 12:28, Christian Borntraeger wrote:
> >
> > On 05.03.20 10:30, Vincent Guittot wrote:
> >> Le mercredi 04 mars 2020 à 20:59:33 (+0100), Christian Borntraeger a écrit :
> >>>
> >>> On 04.03.20 20:38, Christian Borntraeger wrote:
> >>>>
> >>>>
> >>>> On 04.03.20 20:19, Dietmar Eggemann wrote:
>
> [...]
>
> > It seems to speed up the issue when I do a compile job in parallel on the host:
> >
> > Do you also need the sysfs tree?
>
> [ 87.932552] CPU23 path=/machine.slice/machine-test.slice/machine-qemu\x2d18\x2dtest10. on_list=1 nr_running=1 throttled=0 p=[CPU 2/KVM 2662]
> [ 87.932559] CPU23 path=/machine.slice/machine-test.slice/machine-qemu\x2d18\x2dtest10. on_list=0 nr_running=3 throttled=0 p=[CPU 2/KVM 2662]
> [ 87.932562] CPU23 path=/machine.slice/machine-test.slice on_list=1 nr_running=1 throttled=1 p=[CPU 2/KVM 2662]
> [ 87.932564] CPU23 path=/machine.slice on_list=1 nr_running=0 throttled=0 p=[CPU 2/KVM 2662]
> [ 87.932566] CPU23 path=/ on_list=1 nr_running=1 throttled=0 p=[CPU 2/KVM 2662]
> [ 87.951872] CPU23 path=/ on_list=1 nr_running=2 throttled=0 p=[ksoftirqd/23 126]
> [ 87.987528] CPU23 path=/user.slice on_list=1 nr_running=2 throttled=0 p=[as 6737]
> [ 87.987533] CPU23 path=/ on_list=1 nr_running=1 throttled=0 p=[as 6737]
>
> Arrh, looks like 'char path[64]' is too small to hold 'machine.slice/machine-test.slice/machine-qemu\x2d18\x2dtest10.scope/vcpuX' !
> ^
> But I guess that the 'on_list=0' for 'machine-qemu\x2d18\x2dtest10.scope' could be the missing hint?

yes the if (cfs_bandwidth_used()) at the end of enqueue_task_fair is not enough
to ensure that all cfs will be added back. It will "work" for the 1st enqueue
because the throttled cfs will be added and will reset tmp_alone_branch but not
for the next one

Compare to the previous proposed fix, we can optimize it a bit with:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ccde775e02e..3b19e508641d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4035,10 +4035,16 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;

- if (cfs_rq->nr_running == 1) {
+ /*
+ * When bandwidth control is enabled, cfs might have been removed because of
+ * a parent been throttled but cfs->nr_running > 1. Try to add it
+ * unconditionnally.
+ */
+ if (cfs_rq->nr_running == 1 || cfs_bandwidth_used())
list_add_leaf_cfs_rq(cfs_rq);
+
+ if (cfs_rq->nr_running == 1)
check_enqueue_throttle(cfs_rq);
- }
}

static void __clear_buddies_last(struct sched_entity *se)