Re: [PATCH v5 0/3] Thermal ACPI APIs for generic trip points

From: srinivas pandruvada
Date: Thu Jan 19 2023 - 11:58:29 EST


On Thu, 2023-01-19 at 13:17 +0100, Rafael J. Wysocki wrote:
> On Thu, Jan 19, 2023 at 12:04 AM srinivas pandruvada
> <srinivas.pandruvada@xxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, 2023-01-18 at 23:14 +0100, Daniel Lezcano wrote:
> > > On 18/01/2023 22:16, srinivas pandruvada wrote:
> > > > On Wed, 2023-01-18 at 22:01 +0100, Daniel Lezcano wrote:
> > > > > On 18/01/2023 21:53, srinivas pandruvada wrote:
> > > > > > On Wed, 2023-01-18 at 21:00 +0100, Daniel Lezcano wrote:
> > > > > > > On 18/01/2023 20:16, srinivas pandruvada wrote:
> > > > > > >
> > > > > > > [ ... ]
> > > > > > >
> > > > > > > > > > But we'd better wait for the thermald test result
> > > > > > > > > > from
> > > > > > > > > > Srinvias.
> > > > > > > > >
> > > > > > > > > A quick test show that things still work with
> > > > > > > > > thermald
> > > > > > > > > and
> > > > > > > > > these
> > > > > > > > > changes.
> > > > > > > >
> > > > > > > > But I have a question. In some devices trip point
> > > > > > > > temperature
> > > > > > > > is
> > > > > > > > not
> > > > > > > > static. When hardware changes, we get notification. For
> > > > > > > > example
> > > > > > > > INT3403_PERF_TRIP_POINT_CHANGED for INT3403 drivers.
> > > > > > > > Currently get_trip can get the latest changed value.
> > > > > > > > But if
> > > > > > > > we
> > > > > > > > preregister, we need some mechanism to update them.
> > > > > > >
> > > > > > > When the notification INT3403_PERF_TRIP_POINT_CHANGED
> > > > > > > happens, we
> > > > > > > call
> > > > > > > int340x_thermal_read_trips() which in turn updates the
> > > > > > > trip
> > > > > > > points.
> > > > > > >
> > > > > >
> > > > > > Not sure how we handle concurrency here when driver can
> > > > > > freely
> > > > > > update
> > > > > > trips while thermal core is using trips.
> > > > >
> > > > > Don't we have the same race without this patch ? The thermal
> > > > > core
> > > > > can
> > > > > call get_trip_temp() while there is an update, no ?
> > > > Yes it is. But I can add a mutex locally here to solve.
> > > > But not any longer.
> > > >
> > > > I think you need some thermal_zone_read_lock/unlock() in core,
> > > > which
> > > > can use rcu. Even mutex is fine as there will be no contention
> > > > as
> > > > updates to trips will be rare.
> > >
> > > I was planning to provide a thermal_trips_update(tz, trips) and
> > > from
> > > there handle the locking.
> > >
> > > As the race was already existing, can we postpone this change
> > > after
> > > the
> > > generic trip points changes?
> > I think so.
>
> Well, what if this bug is reported by a user and a fix needs to be
> backported to "stable"?
>
> Are we going to backport the whole framework redesign along with it?
>
> Or is this extremely unlikely?
These trips are read at the start of DTT/Thermald and will be read once
after notification is received. So extremely unlikely.
But we can add the patch before this series to address this issue,
which can be marked stable. I can submit this.

Thanks,
Srinivas

From 2157c15ab856585c421119ac4587747311cb2a99 Mon Sep 17 00:00:00 2001
From: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
Date: Thu, 19 Jan 2023 08:37:15 -0800
Subject: [PATCH] thermal: int340x: Protect trip temperature from dynamic
update

Trip temperatures are read using ACPI methods and stored in the memory
during zone initializtion and when the firmware sends a notification for
change. This trip temperature is returned when the thermal core calls via
callback get_trip_temp().

But it is possible that while updating the memory copy of the trips when
the firmware sends a notification for change, thermal core is reading the
trip temperature via the callback get_trip_temp(). This may return invalid
trip temperature.

To address this add a mutex to protect the invalid temperature reads in
the callback get_trip_temp() and int340x_thermal_read_trips().

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
---
.../intel/int340x_thermal/int340x_thermal_zone.c | 16 +++++++++++++++-
.../intel/int340x_thermal/int340x_thermal_zone.h | 1 +
2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
index 62c0aa5d0783..fd9080640e03 100644
--- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
+++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
@@ -49,6 +49,8 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone,
if (d->override_ops && d->override_ops->get_trip_temp)
return d->override_ops->get_trip_temp(zone, trip, temp);

+ mutex_lock(&d->trip_mutex);
+
if (trip < d->aux_trip_nr)
*temp = d->aux_trips[trip];
else if (trip == d->crt_trip_id)
@@ -65,10 +67,14 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone,
break;
}
}
- if (i == INT340X_THERMAL_MAX_ACT_TRIP_COUNT)
+ if (i == INT340X_THERMAL_MAX_ACT_TRIP_COUNT) {
+ mutex_unlock(&d->trip_mutex);
return -EINVAL;
+ }
}

+ mutex_unlock(&d->trip_mutex);
+
return 0;
}

@@ -180,6 +186,8 @@ int int340x_thermal_read_trips(struct int34x_thermal_zone *int34x_zone)
int trip_cnt = int34x_zone->aux_trip_nr;
int i;

+ mutex_lock(&int34x_zone->trip_mutex);
+
int34x_zone->crt_trip_id = -1;
if (!int340x_thermal_get_trip_config(int34x_zone->adev->handle, "_CRT",
&int34x_zone->crt_temp))
@@ -207,6 +215,8 @@ int int340x_thermal_read_trips(struct int34x_thermal_zone *int34x_zone)
int34x_zone->act_trips[i].valid = true;
}

+ mutex_unlock(&int34x_zone->trip_mutex);
+
return trip_cnt;
}
EXPORT_SYMBOL_GPL(int340x_thermal_read_trips);
@@ -230,6 +240,8 @@ struct int34x_thermal_zone *int340x_thermal_zone_add(struct acpi_device *adev,
if (!int34x_thermal_zone)
return ERR_PTR(-ENOMEM);

+ mutex_init(&int34x_thermal_zone->trip_mutex);
+
int34x_thermal_zone->adev = adev;
int34x_thermal_zone->override_ops = override_ops;

@@ -281,6 +293,7 @@ struct int34x_thermal_zone *int340x_thermal_zone_add(struct acpi_device *adev,
acpi_lpat_free_conversion_table(int34x_thermal_zone->lpat_table);
kfree(int34x_thermal_zone->aux_trips);
err_trip_alloc:
+ mutex_destroy(&int34x_thermal_zone->trip_mutex);
kfree(int34x_thermal_zone);
return ERR_PTR(ret);
}
@@ -292,6 +305,7 @@ void int340x_thermal_zone_remove(struct int34x_thermal_zone
thermal_zone_device_unregister(int34x_thermal_zone->zone);
acpi_lpat_free_conversion_table(int34x_thermal_zone->lpat_table);
kfree(int34x_thermal_zone->aux_trips);
+ mutex_destroy(&int34x_thermal_zone->trip_mutex);
kfree(int34x_thermal_zone);
}
EXPORT_SYMBOL_GPL(int340x_thermal_zone_remove);
diff --git a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.h b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.h
index 3b4971df1b33..8f9872afd0d3 100644
--- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.h
+++ b/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.h
@@ -32,6 +32,7 @@ struct int34x_thermal_zone {
struct thermal_zone_device_ops *override_ops;
void *priv_data;
struct acpi_lpat_conversion_table *lpat_table;
+ struct mutex trip_mutex;
};

struct int34x_thermal_zone *int340x_thermal_zone_add(struct acpi_device *,
--
2.38.1