Re: sched/fair: scheduler not running high priority process on idle cpu

From: Dietmar Eggemann
Date: Mon Jan 20 2020 - 04:39:32 EST


On 15/01/2020 18:07, David Laight wrote:
> From Steven Rostedt
>> Sent: 15 January 2020 15:31
> ...
>>> For this case an idle cpu doing a unlocked check for a processes that has
>>> been waiting 'ages' to preempt the running process may not be too
>>> expensive.
>>
>> How do you measure a process waiting for ages on another CPU? And then
>> by the time you get the information to pull it, there's always the race
>> that the process will get the chance to run. And if you think about it,
>> by looking for a process waiting for a long time, it is likely it will
>> start to run because "ages" means it's probably close to being released.
>
> Without a CBU (Crystal Ball Unit) you can always be unlucky.
> But once you get over the 'normal' delays for a system call you probably
> get an exponential (or is it logarithmic) distribution and the additional
> delay is likely to be at least some fraction of the time it has already waited.
>
> While not entirely the same, but something I still need to look at further.
> This is a histogram of time taken (in ns) to send on a raw IPv4 socket.
> 0k: 1874462617
> 96k: 260350
> 160k: 30771
> 224k: 14812
> 288k: 770
> 352k: 593
> 416k: 489
> 480k: 368
> 544k: 185
> 608k: 63
> 672k: 27
> 736k: 6
> 800k: 1
> 864k: 2
> 928k: 3
> 992k: 4
> 1056k: 1
> 1120k: 0
> 1184k: 1
> 1248k: 1
> 1312k: 2
> 1376k: 3
> 1440k: 1
> 1504k: 1
> 1568k: 1
> 1632k: 4
> 1696k: 0 (5 times)
> 2016k: 1
> 2080k: 0
> 2144k: 1
> total: 1874771078, average 32k
>
> I've improved it no end by using per-thread sockets and setting
> the socket write queue size large.
> But there are still some places where it takes > 600us.
> The top end is rather more linear than one might expect.
>
>>> I presume the locks are in place for the migrate itself.
>>
>> Note, by grabbing locks on another CPU will incur overhead on that
>> other CPU. I've seen huge latency caused by doing just this.
>
> I'd have thought this would only be significant if the cache line
> ends up being used by both cpus?
>
>>> The only downside is that the process's data is likely to be in the wrong cache,
>>> but unless the original cpu becomes available just after the migrate it is
>>> probably still a win.
>>
>> If you are doing this with just tasks that are waiting for the CPU to
>> be preemptable, then it is most likely not a win at all.
>
> You'd need a good guess that the wait would be long.
>
>> Now, the RT tasks do have an aggressive push / pull logic, that keeps
>> track of which CPUs are running lower priority tasks and will work hard
>> to keep all RT tasks running (and aggressively migrate them). But this
>> logic still only takes place at preemption points (cond_resched(), etc).
>
> I guess this only 'gives away' extra RT processes.
> Rather than 'stealing' them - which is what I need.

Isn't part of the problem that RT doesn't maintain
cp->pri_to_cpu[CPUPRI_IDLE] (CPUPRI_IDLE = 0).

So push/pull (find_lowest_rq()) never returns a mask of idle CPUs.

There was
https://lore.kernel.org/r/1415260327-30465-2-git-send-email-pang.xunlei@xxxxxxxxxx
in 2014 but it didn't go mainline.

There was a similar question in Nov last year:

https://lore.kernel.org/r/CH2PR19MB3896AFE1D13AD88A17160860FC700@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx