Re: [patch 00/12] thermal/x86_pkg_temp: Sanitize yet another hotplug and locking trainwreck

From: Pandruvada, Srinivas
Date: Mon Nov 21 2016 - 18:16:27 EST


On Mon, 2016-11-21 at 22:34 +0100, Thomas Gleixner wrote:
> On Mon, 21 Nov 2016, Pandruvada, Srinivas wrote:

[...]

> Stupid me. I tested putting a socket offline, which works, but did
> not
> check what happens on module removal. Delta fix below. That needs to
> be
> folded into the series as the wreckage already happens before the
> last
> patch.

Your change below fixes the crash issue. Now I tested a case where the
last cpu offlined from a package, it removed thermal zone and added
zone back once any cpu from the package onlined. So this is working.

I want to try to run some workload on those cpu to bump up the
temperature and check interrupts. I am hitting some issue unrelated to
this change may be. I onlined three cpus from the package 1.

[189443.567728] smpboot: Booting Node 1 Processor 15 APIC 0x2e
[189656.625947] smpboot: Booting Node 1 Processor 8 APIC 0x20
[189829.545851] smpboot: Booting Node 1 Processor 24 APIC 0x21

But I can't schedule anything on those CPUs. For example now can't run
turbostat, it complains
"
turbostat: re-initialized with num_cpus 19
Could not migrate to CPU 8
"

Same with

#taskset 0x100 stress -c 1
taskset: failed to set pid 0's affinity: Invalid argument

I am on the latest linux-pm/linux-next tree on this server. I will
switch to latest main line and try.

Thanks,
Srinivas



8<--------------------
--- a/drivers/thermal/x86_pkg_temp_thermal.c
+++ b/drivers/thermal/x86_pkg_temp_thermal.c
@@ -63,6 +63,7 @@ struct pkg_device {
ÂÂÂÂÂÂÂÂu32ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂmsr_pkg_therm_high;
ÂÂÂÂÂÂÂÂstruct delayed_workÂÂÂÂÂÂÂÂÂÂÂÂÂwork;
ÂÂÂÂÂÂÂÂstruct thermal_zone_deviceÂÂÂÂÂÂ*tzone;
+ÂÂÂÂÂÂÂstruct cpumaskÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂcpumask;
Â};
Â
Âstatic struct thermal_zone_params pkg_temp_tz_params = {
@@ -391,6 +392,7 @@ static int pkg_temp_thermal_device_add(u
ÂÂÂÂÂÂÂÂrdmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT, pkgdev-
>msr_pkg_therm_low,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂpkgdev->msr_pkg_therm_high);
Â
+ÂÂÂÂÂÂÂcpumask_set_cpu(cpu, &pkgdev->cpumask);
ÂÂÂÂÂÂÂÂspin_lock_irq(&pkg_temp_lock);
ÂÂÂÂÂÂÂÂpackages[pkgid] = pkgdev;
ÂÂÂÂÂÂÂÂspin_unlock_irq(&pkg_temp_lock);
@@ -399,13 +401,15 @@ static int pkg_temp_thermal_device_add(u
Â
Âstatic int pkg_thermal_cpu_offline(unsigned int cpu)
Â{
-ÂÂÂÂÂÂÂint target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
ÂÂÂÂÂÂÂÂstruct pkg_device *pkgdev = pkg_temp_thermal_get_dev(cpu);
ÂÂÂÂÂÂÂÂbool lastcpu, was_target;
+ÂÂÂÂÂÂÂint target;
Â
ÂÂÂÂÂÂÂÂif (!pkgdev)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂreturn 0;
Â
+ÂÂÂÂÂÂÂtarget = cpumask_any_but(&pkgdev->cpumask, cpu);
+ÂÂÂÂÂÂÂcpumask_clear_cpu(cpu, &pkgdev->cpumask);
ÂÂÂÂÂÂÂÂlastcpu = target >= nr_cpu_ids;
Â
ÂÂÂÂÂÂÂÂ/*
@@ -492,8 +496,10 @@ static int pkg_thermal_cpu_online(unsign
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂreturn -ENODEV;
Â
ÂÂÂÂÂÂÂÂ/* If the package exists, nothing to do */
-ÂÂÂÂÂÂÂif (pkgdev)
+ÂÂÂÂÂÂÂif (pkgdev) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂcpumask_set_cpu(cpu, &pkgdev->cpumask);
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂreturn 0;
+ÂÂÂÂÂÂÂ}
ÂÂÂÂÂÂÂÂreturn pkg_temp_thermal_device_add(cpu);
Â}
Â