Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints

From: chris hyser
Date: Thu Feb 20 2020 - 09:31:18 EST




On 2/20/20 3:50 AM, Parth Shah wrote:


On 2/20/20 2:04 PM, Parth Shah wrote:


On 2/19/20 11:53 PM, chris hyser wrote:


On 2/19/20 9:15 AM, chris hyser wrote:


On 2/19/20 5:09 AM, Parth Shah wrote:
Hi Chris,

On 2/19/20 4:30 AM, chris hyser wrote:
On 2/17/20 3:57 AM, Parth Shah wrote:


On 1/16/20 5:32 PM, Parth Shah wrote:
This is the 3rd revision of the patch set to introduce
latency_{nice/tolerance} as a per task attribute.

The previous version can be found at:
v1: https://lkml.org/lkml/2019/11/25/151
v2: https://lkml.org/lkml/2019/12/8/10

Changes in this revision are:
v2 -> v3:
- This series changes the longer attribute name to "latency_nice" as per
ÂÂÂ the comment from Dietmar Eggemann
https://lkml.org/lkml/2019/12/5/394
v1 -> v2:
- Addressed comments from Qais Yousef
- As per suggestion from Dietmar, moved content from newly created
ÂÂÂ include/linux/sched/latency_tolerance.h to kernel/sched/sched.h
- Extend sched_setattr() to support latency_tolerance in tools
headers UAPI


Introduction:
==============
This patch series introduces a new per-task attribute latency_nice to
provide the scheduler hints about the latency requirements of the
task [1].

Latency_nice is a ranged attribute of a task with the value ranging
from [-20, 19] both inclusive which makes it align with the task nice
value.

The value should provide scheduler hints about the relative latency
requirements of tasks, meaning the task with "latency_nice = -20"
should have lower latency requirements than compared to those tasks with
higher values. Similarly a task with "latency_nice = 19" can have higher
latency and hence such tasks may not care much about latency.

The default value is set to 0. The usecases discussed below can use this
range of [-20, 19] for latency_nice for the specific purpose. This
patch does not implement any use cases for such attribute so that any
change in naming or range does not affect much to the other (future)
patches using this. The actual use of latency_nice during task wakeup
and load-balancing is yet to be coded for each of those usecases.

As per my view, this defined attribute can be used in following ways
for a
some of the usecases:
1 Reduce search scan time for select_idle_cpu():
- Reduce search scans for finding idle CPU for a waking task with lower
ÂÂÂ latency_nice values.

2 TurboSched:
- Classify the tasks with higher latency_nice values as a small
ÂÂÂ background task given that its historic utilization is very low, for
ÂÂÂ which the scheduler can search for more number of cores to do task
 packing. A task with a latency_nice >= some_threshold (e.g, == 19)
ÂÂÂ and util <= 12.5% can be background tasks.

3 Optimize AVX512 based workload:
- Bias scheduler to not put a task having (latency_nice == -20) on a
ÂÂÂ core occupying AVX512 based workload.


Series Organization:
====================
- Patch 1: Add new attribute latency_nice to task_struct.
- Patch 2: Clone parent task's attribute to the child task on fork
- Patch 3: Add support for sched_{set,get}attr syscall to modify
ÂÂÂÂÂÂÂÂÂÂÂÂ latency_nice of the task


The patch series can be applied on tip/sched/core at the
commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")


References:
============
[1]. Usecases for the per-task latency-nice attribute,
ÂÂÂÂÂÂ https://lkml.org/lkml/2019/9/30/215
[2]. Task Latency-nice, "Subhra Mazumdar",
ÂÂÂÂÂÂ https://lkml.org/lkml/2019/8/30/829
[3]. Introduce per-task latency_tolerance for scheduler hints,
ÂÂÂÂÂÂ https://lkml.org/lkml/2019/12/8/10


Parth Shah (3):
ÂÂÂ sched: Introduce latency-nice as a per-task attribute
ÂÂÂ sched/core: Propagate parent task's latency requirements to the
child
ÂÂÂÂÂ task
ÂÂÂ sched: Allow sched_{get,set}attr to change latency_nice of the task

ÂÂ include/linux/sched.hÂÂÂÂÂÂÂÂÂÂÂ |Â 1 +
ÂÂ include/uapi/linux/sched.hÂÂÂÂÂÂ |Â 4 +++-
ÂÂ include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
ÂÂ kernel/sched/core.cÂÂÂÂÂÂÂÂÂÂÂÂÂ | 21 +++++++++++++++++++++
ÂÂ kernel/sched/sched.hÂÂÂÂÂÂÂÂÂÂÂÂ | 18 ++++++++++++++++++
ÂÂ tools/include/uapi/linux/sched.h |Â 4 +++-
ÂÂ 6 files changed, 65 insertions(+), 2 deletions(-)


Its been a long time and few revisions since the beginning of the
discussion around the latency-nice. Hence thought of asking if there
is/are
any further work that needs to be done for adding latency-nice
attribute or
am I missing any piece in here?

All, I was asked to take a look at the original latency_nice patchset.
First, to clarify objectives, Oracle is not interested in trading
throughput for latency. What we found is that the DB has specific tasks
which do very little but need to do this as absolutely quickly as
possible,
ie extreme latency sensitivity. Second, the key to latency reduction in
the
task wakeup path seems to be limiting variations of "idle cpu" search. The
latter particularly interests me as an example of "platform size based
latency" which I believe to be important given all the varying size VMs
and
containers.

Parth, I've been using your v3 patchset as the basis of an investigation
into the measurable effects of short-circuiting this search. I'm not quite
ready to put anything out, but the patchset is working well. The only

That's a good news as you are able to get a usecase of this patch-set.

feedback I have is that currently non-root can set the value negative
which
is inconsistent with 'nice' and I would think a security hole.


I would assume you mean 'latency_nice' here.

ÂFrom my testing, I was not able to set values for any root owned task's
latency_nice value by the non-root user. Also, my patch-set just piggybacks
on the already existing sched_setattr syscall and hence it should not allow
non-root user to do any modifications. Can you confirm this by changing
nice (renice) value of a root task from non-root user.

I have done the sanity check in the code and thinking where it could
possibly have gone wrong. So, can you please specify what values were you
able to set outside the [-20, 19] range?

The checks prevent being outside that range. But negative numbers -20 to
-1 did not need root. Let me dig some more. I verified this explicitly
before sending the email so something is up.

I went digging. This is absolutely repeatable. I checked that I do not
unknowingly have CAP_SYS_NICE as a user. So first, are we tying
latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I
saw this stated anywhere. Second, the only capability checked in
__sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks
will not return a -EPERM for a negative latency_tolerance (in the code, aka
latency_nice). Do I have the correct version of the code? Am I missing
something?

You are right. I have not added permission checks for setting the
latency_nice value. For the task_nice, non-root user has no permission to
set the value lower than the current value which is not the case with the
latency_nice.

In order to align with the permission checks like task_nice, I will add the
check similar to task_nice and send out the v4 of the series soon.


Thanks for pointing out.
- Parth


The below diff works out well enough in-order to align permission checks
with NICE.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2bfcff5623f9..ef4a397c9170 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p,
return -EINVAL;
if (attr->sched_latency_nice < MIN_LATENCY_NICE)
return -EINVAL;
+ /* Use the same security checks as NICE */
+ if (attr->sched_latency_nice < p->latency_nice &&
+ !can_nice(p, attr->sched_latency_nice))
+ return -EPERM;
}

if (pi)

With the above in effect,
A non-root user can only increase the value upto +19, and once increased
cannot be decreased. e.g., a user once sets the value latency_nice = 19,
the same user cannot set the value latency_nice = 18. This is the same
effect as with NICE.

Is such permission checks required?

Unlike NICE, we are going to use latency_nice for scheduler hints only, and
so won't it make more sense to allow a user to increase/decrease the values
of their owned tasks?

Whether called a hint or not, it is a trade-off to reduce latency of select tasks at the expense of the throughput of the other tasks in the the system. If any of the other tasks belong to other users, you would presumably require permission.

-chrish