Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets

From: Juri Lelli
Date: Thu Feb 04 2016 - 11:30:14 EST


Hi Steve,

On 04/02/16 12:27, Juri Lelli wrote:
> On 04/02/16 12:04, Juri Lelli wrote:
> > On 04/02/16 09:54, Juri Lelli wrote:
> > > Hi Steve,
> > >
> > > first of all thanks a lot for your detailed report, if only all bug
> > > reports were like this.. :)
> > >
> > > On 03/02/16 13:55, Steven Rostedt wrote:
> >
> > [...]
> >
> > >
> > > Right. I think this is the same thing that happens after hotplug. IIRC
> > > the code paths are actually the same. The problem is that hotplug or
> > > cpuset reconfiguration operations are destructive w.r.t. root_domains,
> > > so we lose bandwidth information when that happens. The problem is that
> > > we only store cumulative information regarding bandwidth in root_domain,
> > > while information about which task belongs to which cpuset is store in
> > > cpuset data structures.
> > >
> > > I tried to fix this a while back, but my tentative was broken, I failed
> > > to get locking right and, even though it seemed to fix the issue for me,
> > > it was prone to race conditions. You might still want to have a look at
> > > that for reference: https://lkml.org/lkml/2015/9/2/162
> > >
> >
> > [...]
> >
> > >
> > > It's good that we can recover, but that's still a bug yes :/.
> > >
> > > I'll try to see if my broken patch make what you are seeing apparently
> > > disappear, so that we can at least confirm that we are seeing the same
> > > problem; you could do the same if you want, I pushed that here
> > >
> >
> > No it doesn't solve this :/. I placed restoring code in the hotplug
> > workfn, so updates generated by toggling sched_load_balance don't get
> > caught, of course. But, this at least tells us that we should solve this
> > someplace else.
> >
>
> Well, if I call an unlocked version of my cpuset_hotplug_update_rd()
> from kernel/cpuset.c:update_flag() the issue seems to go away. But, we
> end up overcommitting the default null domain (try to toggle sched_load_
> balance multiple times). I updated the branch, but I still think we
> should solve this differently.
>

I've actually changed a bit this approach, and things seem better here.
Could you please give this a try? (You can also fetch the same branch).

Thanks,

- Juri

--->8---