Re: [patch 05/12] timekeeping_notify: use stop_machine_fail when appropriate

From: Marcelo Tosatti
Date: Fri Feb 09 2024 - 11:09:15 EST


On Thu, Feb 08, 2024 at 04:23:58PM +0100, Thomas Gleixner wrote:
> On Wed, Feb 07 2024 at 09:58, Marcelo Tosatti wrote:
> > On Wed, Feb 07, 2024 at 12:57:19PM +0100, Thomas Gleixner wrote:
> >> On Tue, Feb 06 2024 at 15:49, Marcelo Tosatti wrote:
> >> > Change timekeeping_notify to use stop_machine_fail when appropriate,
> >> > which will fail in case the target CPU is tagged as block interference
> >> > CPU.
> >>
> >> You completely fail to explain 'appropriate'. There is zero reason for
> >> this churn, really.
> >
> > The churn is so that we can return an error to
> > current_clocksource_store (sysfs handler for writes to
> > /sys/devices/system/clocksource/clocksource0/current_clocksource).
>
> What for? Why?
>
> Writing to that file requires root. Root can rightfully screw up a
> system and adding a debugfs based "prevention" mechanism is not making
> this any better because root can just clear the CPU mask there and move
> on.
>
> So what is the actual real world problem solved by these patches?
>
> All I've seen so far is handwaving about interference prevention and TBH
> I can't squint hard enough to believe that.
>
> Thanks,
>
> tglx


Thomas,

The problem is an administrator is not aware of the relationship between

Kernel interface -> generation of IPIs.


Even less so of which applications in the system are accessing
which kernel interfaces.

Let me give some examples:

1) Change of trip temperatures from userspace.

static int
sys_set_trip_temp(struct thermal_zone_device *tzd, int trip, int temp)
{
struct zone_device *zonedev = thermal_zone_device_priv(tzd);
u32 l, h, mask, shift, intr;
int tj_max, val, ret;

tj_max = intel_tcc_get_tjmax(zonedev->cpu);
if (tj_max < 0)
return tj_max;
tj_max *= 1000;

val = (tj_max - temp)/1000;

if (trip >= MAX_NUMBER_OF_TRIPS || val < 0 || val > 0x7f)
return -EINVAL;

ret = rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
&l, &h);
if (ret < 0)
return ret;

if (trip) {
mask = THERM_MASK_THRESHOLD1;
shift = THERM_SHIFT_THRESHOLD1;
intr = THERM_INT_THRESHOLD1_ENABLE;
} else {
mask = THERM_MASK_THRESHOLD0;
shift = THERM_SHIFT_THRESHOLD0;
intr = THERM_INT_THRESHOLD0_ENABLE;
}
l &= ~mask;
/*
* When users space sets a trip temperature == 0, which is indication
* that, it is no longer interested in receiving notifications.
*/
if (!temp) {
l &= ~intr;
} else {
l |= val << shift;
l |= intr;
}

return wrmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
l, h);
}

It seems plausible that userspace application decides to change trip temperature.

2) Read of per-CPU registers (from sysfs).

arch/arm64/kernel/topology.c

static inline
int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val)
{
/*
* Abort call on counterless CPU or when interrupts are
* disabled - can lead to deadlock in smp sync call.
*/
if (!cpu_has_amu_feat(cpu))
return -EOPNOTSUPP;

if (WARN_ON_ONCE(irqs_disabled()))
return -EPERM;

smp_call_function_single(cpu, func, val, 1);

return 0;
}

sysfs read -> cppc_get_perf_caps -> cpc_read -> cpc_read_ffh -> counters_read_on_cpu

#define define_one_cppc_ro(_name) \
static struct kobj_attribute _name = \
__ATTR(_name, 0444, show_##_name, NULL)

This one does not even require root.

Other interfaces on the same class:

show_pw20_wait_time (PowerPC)
uncore_read_freq (x86)
..

So while I understand your point that root can (and should be)
able to perform any operations on the system, this interface would
be along the lines of:

"I don't want isolated CPUs to be interrupted, but i am not aware of
which kernel interfaces can result in interruptions to isolated CPUs.

Lets indicate through this cpumask, which the kernel can consult,
that we'd like userspace operations to fail, if they were going
to interrupt an isolated CPU".

Its the kernel that knows which operations might interrupt
isolated CPUs, not userspace.

Also https://www.spinics.net/lists/kernel/msg5094328.html.