Re: [PATCH] psi:fix divide by zero in psi_update_stats

From: Suren Baghdasaryan
Date: Tue Nov 12 2019 - 12:27:25 EST


On Tue, Nov 12, 2019 at 8:08 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
> > On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
> > > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
> > > > In psi_update_stats, it is possible that period has value like
> > > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
> > > > truncates u64 period to u32, results in zero divisor.
> > > > Use div64_u64() instead of div_u64() if the divisor is u64 to avoid
> > > > truncation to 32-bit on 64-bit platforms.
> > > >
> > > > Signed-off-by: xiejingfeng <xiejingfeng@xxxxxxxxxxxxxxxxx>
> > >
> > > This is legit. When we stop the periodic averaging worker due to an
> > > idle CPU, the period after restart can be much longer than the ~4 sec
> > > in the lower 32 bits. See the missed_periods logic in update_averages.
> >
> > Argh, that's not right. Of course I notice right after hitting send.
> >
> > missed_periods are subtracted out of the difference between now and
> > the last update, so period should be not much bigger than 2s.
> >
> > Something else is going on here.
>
> Tim, does this happen right after boot? I wonder if it's because we're
> not initializing avg_last_update, and the initial delta between the
> last update (0) and the first scheduled update (sched_clock() + 2s)
> ends up bigger than 4 seconds somehow. Later on, the delta between the
> last and the scheduled update should always be ~2s. But for that to
> happen, it would require a pretty slow boot, or a sched_clock() that
> does not start at 0.
>
> Tim, if you have a coredump, can you extract the value of the other
> variables printed in the following patch?
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..1b6836d23091 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
> */
> avg_next_update = expires + ((1 + missed_periods) * psi_period);
> period = now - (group->avg_last_update + (missed_periods * psi_period));
> +
> + WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
> + period, now, expires, group->avg_last_update, missed_periods);
> +
> group->avg_last_update = now;
>
> for (s = 0; s < NR_PSI_STATES - 1; s++) {
>
> And we may need something like this to make the tick initialization
> more robust regardless of the reported bug here:
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84af7aa158bf..ce8f6748678a 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
>
> for_each_possible_cpu(cpu)
> seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> - group->avg_next_update = sched_clock() + psi_period;
> + group->avg_last_update = sched_clock();
> + group->avg_next_update = group->avg_last_update + psi_period;
> INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> mutex_init(&group->avgs_lock);
> /* Init trigger-related members */

Both fixes for group_init() and window_update() make sense to me.
window_update() division would be reproducible because win->size is
set during trigger setup and does not change afterwards. Since
userspace defines the window size in usecs this would require doing
some math and finding a value that yields zeros in 32 LSBs after
conversion into nsecs (see: t->win.size = window_us * NSEC_PER_USEC in
psi_trigger_create()). Haven't seen this issue because all my test
cases (1, 2, 10secs) had non-zero LSBs.