Re: [PATCH v1 0/6] x86/resctrl: Avoid searching tasklist during mongrp_reparent

From: Reinette Chatre
Date: Thu Apr 04 2024 - 19:09:31 EST


Hi Peter,

On 3/25/2024 10:27 AM, Peter Newman wrote:
> Hi Reinette,
>
> I've been working with users of the recently-added mongroup rename
> operation[1] who have observed the impact of back-to-back operations on
> latency-sensitive, thread pool-based services. Because changing a
> resctrl group's CLOSID (or RMID) requires all task_structs in the system
> to be inspected with the tasklist_lock read-locked, a series of move
> operations can block out thread creation for long periods of time, as
> creating threads needs to write-lock the tasklist_lock.

Could you please give some insight into the delays experienced? "long
periods of time" mean different things to different people and this
series seems to get more ominous as is progresses with the cover letter
starting with "long periods of time" and by the time the final patch
appears it has become "disastrous".

>
> To avoid this issue, this series replaces the CLOSID and RMID values
> cached in the task_struct with a pointer to the task's rdtgroup, through
> which the current CLOSID and RMID can be obtained indirectly during a

On a high level this seems fair but using a pointer to the rdtgroup in the
task_struct means that the scheduling code needs to dereference that pointer
without any lock held. The changes do take great care and it
would be valuable to clearly document how these accesses are safe. (please
see patch #4 and #6)

> context switch. Updating a group's ID then only requires the current
> task to be switched back in on all CPUs. On server hosts with very large
> thread counts, this is much less disruptive than making thread creation
> globally unavailable. However, this is less desirable on CPU-isolated,
> realtime workloads, so I am interested in suggestions on how to reach a
> compromise for the two use cases.

As I understand this only impacts moving a monitor group? To me this sounds
like a user space triggered event associated with a particular use case that
may not be relevant to the workloads that you refer to. I think this could be
something that can be documented for users with this type of workloads.
(please see patch #6)

>
> Also note that the soft-ABMC[2] proposal is based on swapping the RMID
> values in mongroups when monitors are assigned to them, which will
> result in the same slowdown as was encountered with re-parenting
> monitoring groups.
>
> Using pointers to rdtgroups from the task_struct been used internally at

"have been used internally"?

> Google for a few years to support an alternate CLOSID management
> technique[3], which was replaced by mongroup rename. Usage of this
> technique during production revealed an issue where threads in an
> exiting process become unreachable via for_each_process_thread() before
> destruction, but can still switch in and out. Consequently, an rmdir
> operation can fail to remove references to an rdtgroup before it frees
> it, causing the pointer to freed memory to be dereferenced after the
> structure is freed. This would result in invalid CLOSID/RMID values
> being written into MSRs, causing an exception. The workaround for this
> is to clear a task's rdtgroup pointer when it exits.

This does not sound like a problem unique to this new implementation but the
"invalid CLOSID/RMID values being written into MSRs" sounds like something
that happens today? Probably not worth pulling out for this reason because
in its current form the exiting process will keep using the original
CLOSID/RMID that have no issue being written to MSRs.

Reinette