Re: 5.6-rc3: WARNING: CPU: 48 PID: 17435 at kernel/sched/fair.c:380 enqueue_task_fair+0x328/0x440

From: Vincent Guittot
Date: Wed Mar 04 2020 - 12:51:58 EST


On Wed, 4 Mar 2020 at 18:42, Christian Borntraeger
<borntraeger@xxxxxxxxxx> wrote:
>
>
>
> On 04.03.20 16:26, Vincent Guittot wrote:
> > On Tue, 3 Mar 2020 at 08:55, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> >>
> >> On Tue, 3 Mar 2020 at 08:37, Christian Borntraeger
> >> <borntraeger@xxxxxxxxxx> wrote:
> >>>
> >>>
> >>>
> > [...]
> >>>>>> ---
> >>>>>> kernel/sched/fair.c | 2 +-
> >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>>> index 3c8a379c357e..beb773c23e7d 100644
> >>>>>> --- a/kernel/sched/fair.c
> >>>>>> +++ b/kernel/sched/fair.c
> >>>>>> @@ -4035,8 +4035,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >>>>>> __enqueue_entity(cfs_rq, se);
> >>>>>> se->on_rq = 1;
> >>>>>>
> >>>>>> + list_add_leaf_cfs_rq(cfs_rq);
> >>>>>> if (cfs_rq->nr_running == 1) {
> >>>>>> - list_add_leaf_cfs_rq(cfs_rq);
> >>>>>> check_enqueue_throttle(cfs_rq);
> >>>>>> }
> >>>>>> }
> >>>>>
> >>>>> Now running for 3 hours. I have not seen the issue yet. I can tell tomorrow if this fixes
> >>>>> the issue.
> >>>>
> >>>>
> >>>> Still running fine. I can tell for sure tomorrow, but I have the impression that this makes the
> >>>> WARN_ON go away.
> >>>
> >>> So I guess this change "fixed" the issue. If you want me to test additional patches, let me know.
> >>
> >> Thanks for the test. For now, I don't have any other patch to test. I
> >> have to look more deeply how the situation happens.
> >> I will let you know if I have other patch to test
> >
> > So I haven't been able to figure out how we reach this situation yet.
> > In the meantime I'm going to make a clean patch with the fix above.
> >
> > Is it ok if I add a reported -by and a tested-by you ?
>
> Sure-
> I just realized that this system has something special. Some month ago I created 2 slices
> $ head /etc/systemd/system/*.slice
> ==> /etc/systemd/system/machine-production.slice <==
> [Unit]
> Description=VM production
> Before=slices.target
> Wants=machine.slice
> [Slice]
> CPUQuota=2000%
> CPUWeight=1000
>
> ==> /etc/systemd/system/machine-test.slice <==
> [Unit]
> Description=VM production
> Before=slices.target
> Wants=machine.slice
> [Slice]
> CPUQuota=300%
> CPUWeight=100
>
>
> And the guests are then put into these slices. that also means that this test will never use more than the 2300%.
> No matter how much CPUs the system has.

Thanks for the information, I will try to see how this could impact the enqueue

>