Re: [PATCH 2/2] arch_topology: Sanity check cpumask in thermal pressure update

From: Bjorn Andersson
Date: Wed Jan 19 2022 - 10:21:10 EST


On Wed 19 Jan 06:43 PST 2022, Sudeep Holla wrote:

> On Tue, Jan 18, 2022 at 10:56:12AM -0800, Bjorn Andersson wrote:
> > Occasionally during boot the Qualcomm cpufreq driver was able to cause
> > an invalid memory access in topology_update_thermal_pressure() on the
> > line:
> >
> > if (max_freq <= capped_freq)
> >
> > It turns out that this was caused by a race, which resulted in the
> > cpumask passed to the function being empty, in which case
> > cpumask_first() will return a cpu beyond the number of valid cpus, which
> > when used to access the per_cpu max_freq would return invalid pointer.
> >
> > The bug in the Qualcomm cpufreq driver is being fixed, but having a
> > sanity check of the arguments would have saved quite a bit of time and
> > it's not unlikely that others will run into the same issue.
> >
> > Signed-off-by: Bjorn Andersson <bjorn.andersson@xxxxxxxxxx>
> > ---
> > drivers/base/arch_topology.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 976154140f0b..6560a0c3b969 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -177,6 +177,9 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
> > u32 max_freq;
> > int cpu;
> >
> > + if (WARN_ON(cpumask_empty(cpus)))
> > + return;
> > +
>
> Why can't the caller check and call this only when cpus is not empty ?
> IIUC there are many such APIs that use cpumask and could result in similar
> issues if called with empty cpus. Probably we could add a note that cpus
> must not be empty if that helps the callers ?
>

As indicated in the commit message, it took me a while to conclude that
the cause for a memory fault on what seemed to be a comparison between
two variables on the stack was actually caused by this race - which
isn't trivially reproducible, unless you know what the bug is.

Now _I_ know better and will hopefully recognize the oops signature
right away, but my hope was to put the sanity check on this side to save
the next caller of this API some time. Updating the comment probably
would have saved me a minute or two at the end, probably as confirmation
of my findings after the fact...

If you prefer to keep topology_update_thermal_pressure() clean(er) and
exciting I can hack around the issue in the Qualcomm driver.

PS. I'm onboard with Greg's objection to the WARN_ON()...

Regards,
Bjorn