Re: [RFC PATCH 0/2] Introduce per-task io utilization boost

From: Vincent Guittot
Date: Fri Mar 22 2024 - 14:09:22 EST


Hi Christian,

On Mon, 4 Mar 2024 at 21:17, Christian Loehle <christian.loehle@xxxxxxx> wrote:
>
> There is a feature inside of both schedutil and intel_pstate called
> iowait boosting which tries to prevent selecting a low frequency
> during IO workloads when it impacts throughput.
> The feature is implemented by checking for task wakeups that have
> the in_iowait flag set and boost the CPU of the rq accordingly
> (implemented through cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT)).
>
> The necessity of the feature is argued with the potentially low
> utilization of a task being frequently in_iowait (i.e. most of the
> time not enqueued on any rq and cannot build up utilization).
>
> The RFC focuses on the schedutil implementation.
> intel_pstate frequency selection isn't touched for now, suggestions are
> very welcome.
> Current schedutil iowait boosting has several issues:
> 1. Boosting happens even in scenarios where it doesn't improve
> throughput. [1]
> 2. The boost is not accounted for in EAS: a) feec() will only consider
> the actual utilization for task placement, but another CPU might be
> more energy-efficient at that capacity than the boosted one.)
> b) When placing a non-IO task while a CPU is boosted compute_energy()
> will not consider the (potentially 'free') boosted capacity, but the
> one it would have without the boost (since the boost is only applied
> in sugov).
> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> in_iowait wakeups.
> 4. The boost isn't associated with a task, it therefore isn't considered
> for task placement, potentially missing out on higher capacity CPUs on
> heterogeneous CPU topologies.
> 5. The boost isn't associated with a task, it therefore lingers on the
> rq even after the responsible task has migrated / stopped.
> 6. The boost isn't associated with a task, it therefore needs to ramp
> up again when migrated.
> 7. Since schedutil doesn't know which task is getting woken up,
> multiple unrelated in_iowait tasks might lead to boosting.
>
> We attempt to mitigate all of the above by reworking the way the
> iowait boosting (io boosting from here on) works in two major ways:
> - Carry the boost in task_struct, so it is a per-task attribute and
> behaves similar to utilization of the task in some ways.
> - Employ a counting-based tracking strategy that only boosts as long
> as it sees benefits and returns to no boosting dynamically.

Thanks for working on improving IO boosting. I have started to read
your patchset and have few comments about your proposal:

The main one is that the io boosting decision should remain a cpufreq
governor decision and so the io boosting value should be applied by
the governor like in sugov_effective_cpu_perf() as an example instead
of everywhere in the scheduler code.

Then, the algorithm to track the right interval bucket and the mapping
of intervals into utilization really looks like a policy which has
been defined with heuristics and as a result further seems to be a
governor decision

Finally adding some atomic operation in the fast path is not really desirable

I will continue to review your patchset

>
> Note that some the issues (1, 3) can be solved by using a
> counting-based strategy on a per-rq basis, i.e. in sugov entirely.
> Experiments with Android in particular showed that such a strategy
> (which necessarily needs longer intervals to be reasonably stable)
> is too prone to migrations to be useful generally.
> We therefore consider the additional complexity of such a per-task
> based approach like proposed to be worth it.
>
> We require a minimum of 1000 iowait wakeups per second to start
> boosting.
> This isn't too far off from what sugov currently does, since it resets
> the boost if it hasn't seen an iowait wakeup for TICK_NSEC.
> For CONFIG_HZ=1000 we are on par, for anything below we are stricter.
> We justify this by the small possible improvement by boosting in the
> first place with 'rare' few iowait wakeups.
>
> When IO even leads to a task being in iowait isn't as straightforward
> to explain.
> Of course if the issued IO can be served by the page cache (e.g. on
> reads because the pages are contained, on writes because they can be
> marked dirty and the writeback takes care of it later) the actual
> issuing task is usually not in iowait.
> We consider this the good case, since whenever the scheduler and a
> potential userspace / kernel switch is in the critical path for IO
> there is possibly overhead impacting throughput.
> We therefore focus on random read from here on, because (on synchronous
> IO [3]) this will lead to the task being set in iowait for every IO.
> This is where iowait boosting shows its biggest throughput improvement.
> From here on IOPS (IO operations per second) and iowait wakeups may
> therefore be used interchangeably.
>
> Performance:
> Throughput for random read tries to be on par with the sugov
> implementation of iowait boosting for reasonably long-lived workloads.
> See the following table for some results, values are in IOPS, the
> tests are ran for 30s with pauses in-between, results are sorted.
>
> nvme on rk3399
> [3588, 3590, 3597, 3632, 3745] sugov mainline
> [3581, 3751, 3770, 3771, 3885] per-task tracking
> [2592, 2639, 2701, 2717, 2784] sugov no iowait boost
> [3218, 3451, 3598, 3848, 3921] performance governor
>
> emmc with cqe on rk3399
> [4146, 4155, 4159, 4161, 4193] sugov mainline
> [2848, 3217, 4375, 4380, 4454] per-task tracking
> [2510, 2665, 3093, 3101, 3105] sugov no iowait boost
> [4690, 4803, 4860, 4976, 5069] performance governor
>
> sd card on rk3399
> [1777, 1780, 1806, 1827, 1850] sugov mainline
> [1470, 1476, 1507, 1534, 1586] per-task tracking
> [1356, 1372, 1373, 1377, 1416] sugov no iowait boost
> [1861, 1890, 1901, 1905, 1908] performance governor
>
> Pixel 6 ufs Android 14 (7 runs for because device showed some variance)
> [6605, 6622, 6633, 6652, 6690, 6697, 6754] sugov mainline
> [7141, 7173, 7198, 7220, 7280, 7427, 7452] per-task tracking
> [2390, 2392, 2406, 2437, 2464, 2487, 2813] sugov no iowait boost
> [7812, 7837, 7837, 7851, 7900, 7959, 7980] performance governor
>
> Apple M1 apple-nvme
> [27421, 28331, 28515, 28699, 29529] sugov mainline
> [27274, 27344, 27345, 27384, 27930] per-task tracking
> [14480, 14512, 14625, 14872, 14967] sugov no iowait boost
> [31595, 32085, 32386, 32465, 32643] performance governor
>
> Showcasing some different IO scenarios, again all random read,
> median out of 5 runs, all on rk3399 with NVMe.
> e.g. io_uring6x4 means 6 threads with 4 iodepth each, results can be
> obtained using:
> fio --minimal --time_based --name=test --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=io_uring --iodepth=4 --numjobs=6 --group_reporting | cut -d \; -f 8
>
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | | Sugov mainline | Per-task tracking | Sugov no boost | Performance | Powersave |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx1 | 4073 | 3793 | 2979 | 4190 | 2788 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx4 | 13921 | 13503 | 10635 | 13931 | 10225 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx6 | 18473 | 17866 | 15902 | 19080 | 15789 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx8 | 22498 | 21242 | 19867 | 22650 | 18837 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx10 | 24801 | 23552 | 23658 | 25096 | 21474 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | psyncx12 | 26743 | 25377 | 26372 | 26663 | 23613 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio1x1 | 4054 | 3542 | 2776 | 4055 | 2780 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio1x128 | 3959 | 3516 | 2758 | 3590 | 2560 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio4x128 | 13451 | 12517 | 10313 | 13403 | 9994 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio6x1 | 18394 | 17432 | 15340 | 18954 | 15251 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio6x4 | 18329 | 17100 | 15238 | 18623 | 15270 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | libaio6x128 | 18066 | 16964 | 15139 | 18577 | 15192 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring1x1 | 4043 | 3548 | 2810 | 4039 | 2689 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring4x64 | 35790 | 32814 | 35983 | 34934 | 33254 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring1x128 | 32651 | 30427 | 32429 | 33232 | 9973 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring2x128 | 34928 | 32595 | 34922 | 33726 | 18790 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring4x128 | 34414 | 32173 | 34932 | 33332 | 33005 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring6x4 | 31578 | 29260 | 31714 | 31399 | 31784 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
> | io_uring6x128 | 34480 | 32634 | 34973 | 33390 | 36452 |
> +---------------+----------------+-------------------+----------------+-------------+-----------+
>
> Based on the above we can basically categorize these into the following
> three:
> a) boost is useful
> b) boost irrelevant (util dominates)
> c) boost is energy-inefficient (boost dominates)
>
> The aim of the patch 1/2 is to boost as much as necessary for a) while
> boosting little for c) (thus saving energy).
>
> Energy-savings:
> Regarding sugov iowait boosting problem 1 mentioned earlier,
> some improvement can be seen:
> Tested on rk3399 (LLLL)(bb) with an NVMe, 30s runtime
> CPU0 perf domain spans 0-3 with 400MHz to 1400MHz
> CPU4 perf domain spans 4-5 with 400MHz to 1800MHz
>
> io_uring6x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.180 GHz
> Average frequency for CPU4 : 1.504 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.070 GHz
> Average frequency for CPU4 : 1.211 GHz
>
> io_uring12x128:
> Sugov iowait boost:
> Average frequency for CPU0 : 1.324 GHz
> Average frequency for CPU4 : 1.444 GHz
> Per-task tracking:
> Average frequency for CPU0 : 1.260 GHz
> Average frequency for CPU4 : 1.062 GHz
> (In both cases actually 400MHz on both perf domains is optimal, more
> fine-tuning could get us closer [2])
>
> [1]
> There are many scenarios when it doesn't, so let's start with
> explaining when it does:
> Boosting improves throughput if there is frequent IO to a device from
> one or few origins, such that the device is likely idle when the task
> is enqueued on the rq and reducing this time cuts down on the storage
> device idle time.
> This might not be true (and boosting doesn't help) if:
> - The storage device uses the idle time to actually commit the IO to
> persistent storage or do other management activity (this can be
> observed with e.g. writes to flash-based storage, which will usually
> write to cache and flush the cache when idle or necessary).
> - The device is under thermal pressure and needs idle time to cool off
> (not uncommon for e.g. nvme devices).
> Furthermore the assumption (the device being idle while task is
> enqueued) is false altogether if:
> - Other tasks use the same storage device.
> - The task uses asynchronous IO with iodepth > 1 like io_uring, the
> in_iowait is then just to fill the queue on the host again.
> - The task just sets in_iowait to signal it is waiting on io to not
> appear as system idle, it might not send any io at all (cf with
> the various occurrences of in_iowait, io_mutex_lock and io_schedule*).
>
> [3]
> Unfortunately even for asynchronous IO iowait may be set, in the case
> of io_uring this is specifically for the iowait boost to trigger, see
> commit ("8a796565cec3 io_uring: Use io_schedule* in cqring wait")
> which is why the energy-savings are so significant here, as io_uring
> load on the CPU is minimal.
>
> Problems encountered:
> - Higher cap is not always beneficial, we might place the task away
> from the CPU where the interrupt handler is running, making it run
> on an unboosted CPU which may have a bigger impact than the difference
> between the CPU's capacity the task moved to. (Of course the boost will
> then be reverted again, but a ping-pong every interval is possible).
> - [2] tracking and scaling can be improved (io_uring12x128 still shows
> boosting): Unfortunately tracking purely per-task shows some limits.
> One task might show more iowaits per second when boosted, but overall
> throughput doesn't increase => there is still some boost.
> The task throughput improvement is somewhat limited though,
> so by fine-tuning the thresholds there could be mitigations.
>
> Christian Loehle (2):
> sched/fair: Introduce per-task io util boost
> cpufreq/schedutil: Remove iowait boost
>
> include/linux/sched.h | 15 +++
> kernel/sched/cpufreq_schedutil.c | 150 ++--------------------------
> kernel/sched/fair.c | 165 +++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 4 +-
> 4 files changed, 182 insertions(+), 152 deletions(-)
>
> --
> 2.34.1
>