Re: [PATCH] sched: Do not re-read h_load_next during hierarchical load calculation

From: Mel Gorman
Date: Tue Mar 19 2019 - 08:33:59 EST


On Tue, Mar 19, 2019 at 01:06:09PM +0100, Peter Zijlstra wrote:
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..34aeb40e69d2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7726,7 +7726,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
> > cfs_rq->last_h_load_update = now;
> > }
> >
> > - while ((se = cfs_rq->h_load_next) != NULL) {
> > + while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
> > load = cfs_rq->h_load;
> > load = div64_ul(load * se->avg.load_avg,
> > cfs_rq_load_avg(cfs_rq) + 1);
>
> Where there is a READ_ONCE there should also be a corresponding
> WRITE_ONCE(). Otherwise the compiler can still screw us over by doing
> store-tearing.
>
> So something like the below. But looking at this, we probably also want
> ONCE treatment on cfs_rq->h_load itself, but that's another patch.
>
> And I think we can do something with cfs_rq->last_h_load_update.
>

Ok, I hadn't taken into account the possibility of store tearing but you're
right, it is a possibility depending on the architecture and your patch
is better. While there are some potential issues around h_load, I had
treated it as data that is approximate and only considered the potential
NULL dereference to be relevant.

I'll send a v2 shortly.