Re: [PATCH v6 08/11] platform/x86/intel/ifs: Add scan test support

From: Thomas Gleixner
Date: Fri May 06 2022 - 09:30:51 EST


On Thu, May 05 2022 at 18:40, Tony Luck wrote:
> +/*
> + * Note all code and data in this file is protected by
> + * ifs_sem. On HT systems all threads on a core will
> + * execute together, but only the first thread on the
> + * core will update results of the test.
> + */
> +struct workqueue_struct *ifs_wq;

Seems to be unused.

> +static bool oscan_enabled = true;

What changes this?

> +static void message_not_tested(struct device *dev, int cpu, union ifs_status status)
> +{
> + if (status.error_code < ARRAY_SIZE(scan_test_status))

Please add curly brackets as these are not one-line statements.

> + dev_info(dev, "CPU(s) %*pbl: SCAN operation did not start. %s\n",
> + cpumask_pr_args(topology_sibling_cpumask(cpu)),
> + scan_test_status[status.error_code]);
> +/*
> + * Execute the scan. Called "simultaneously" on all threads of a core
> + * at high priority using the stop_cpus mechanism.
> + */
> +static int doscan(void *data)
> +{
> + int cpu = smp_processor_id();
> + u64 *msrs = data;
> + int first;
> +
> + /* Only the first logical CPU on a core reports result */
> + first = cpumask_first(topology_sibling_cpumask(cpu));

Shouldn't that be cpu_smt_mask()?

> + /*
> + * This WRMSR will wait for other HT threads to also write
> + * to this MSR (at most for activate.delay cycles). Then it
> + * starts scan of each requested chunk. The core scan happens
> + * during the "execution" of the WRMSR. This instruction can
> + * take up to 200 milliseconds before it retires.

200ms per test chunk?

> + */
> + wrmsrl(MSR_ACTIVATE_SCAN, msrs[0]);
> +

> + while (activate.start <= activate.stop) {
> + if (time_after(jiffies, timeout)) {
> + status.error_code = IFS_SW_TIMEOUT;
> + break;
> + }
> +
> + msrvals[0] = activate.data;
> + stop_core_cpuslocked(cpu, doscan, msrvals);
> +
> + status.data = msrvals[1];
> +
> + /* Some cases can be retried, give up for others */
> + if (!can_restart(status))
> + break;
> +
> + if (status.chunk_num == activate.start) {
> + /* Check for forward progress */
> + if (retries-- == 0) {
> + if (status.error_code == IFS_NO_ERROR)
> + status.error_code = IFS_SW_PARTIAL_COMPLETION;
> + break;
> + }
> + } else {
> + retries = MAX_IFS_RETRIES;
> + activate.start = status.chunk_num;
> + }
> + }

Looks way better now.

> +}
> +/*
> + * Initiate per core test. It wakes up work queue threads on the target cpu and
> + * its sibling cpu. Once all sibling threads wake up, the scan test gets executed and
> + * wait for all sibling threads to finish the scan test.
> + */
> +int do_core_test(int cpu, struct device *dev)
> +{
> + int ret = 0;
> +
> + if (!scan_enabled)
> + return -ENXIO;
> +
> + /* Prevent CPUs from being taken offline during the scan test */
> + cpus_read_lock();
> +
> + if (!cpu_online(cpu)) {
> + dev_info(dev, "cannot test on the offline cpu %d\n", cpu);
> + ret = -EINVAL;
> + goto out;
> + }

Coming back to my points from the previous round:

1) How is that supposed to work on a system which has HT enabled in BIOS,
but disabled on the kernel command line or via /sys/..../smt/control or
when a HT sibling is offlined temporarily?

I assume it cannot work, but I can't see anything which handles those
cases.

2) That documentation for the admin/user got eaten by the gremlins in
the intertubes again.

Thanks,

tglx