Re: [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)

From: Neil Armstrong
Date: Mon Jul 28 2025 - 12:56:24 EST


Hi,

On 28/07/2025 17:19, Bart Van Assche wrote:
On 7/28/25 7:49 AM, André Draszik wrote:
Btw, my complete command was (should probably have added that
to the commit message in the first place):

for rw in read write ; do
     echo "rw: ${rw}"
     for jobs in 1 8 ; do
         echo "jobs: ${jobs}"
         for it in $(seq 1 5) ; do
             fio --name=rand${rw} --rw=rand${rw} \
                 --ioengine=libaio --direct=1 \
                 --bs=4k --numjobs=${jobs} --size=32m \
                 --runtime=30 --time_based --end_fsync=1 \
                 --group_reporting --filename=/foo \
             | grep -E '(iops|sys=|READ:|WRITE:)'
             sleep 5
         done
     done
done

Please run performance tests in recovery mode against a block
device (/dev/block/sd...) instead of running performance tests on
top of a filesystem. One possible approach for retrieving the block
device name is as follows:

adb shell readlink /dev/block/by-name/userdata

There may be other approaches for retrieving the name of the block
device associated with /data. Additionally, tuning for maximum
performance is useful because it eliminates impact from the process
scheduler on block device performance measurement. An extract from a
scrip that I use myself to measure block device performance on Pixel
devices is available below.

Of course, I did all that and ran on the SM8650 QRD & HDK boards, one has
an UFS 3.1 device and the other an UFS 4.0 device.

Here's the raw data:

Board: sm8650-qrd
read / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 3,996.00 5,921.60 3,424.80
max IOPS 4,772.80 6,491.20 4,541.20
avg IOPS 4,526.25 6,295.31 4,320.58
cpu % usr 4.62 2.96 4.50
cpu % sys 21.45 17.88 21.62
bw MB/s 18.54 25.78 17.64

read / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 51,867.60 51,575.40 45,257.00
max IOPS 67,513.60 64,456.40 56,336.00
avg IOPS 64,314.80 62,136.76 52,505.72
cpu % usr 3.98 3.72 3.52
cpu % sys 16.70 17.16 18.74
bw MB/s 263.60 254.40 215.00

write / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 5,654.80 8,060.00 5,730.80
max IOPS 6,720.40 8,852.00 6,981.20
avg IOPS 6,576.91 8,579.81 6,726.51
cpu % usr 7.48 3.79 8.49
cpu % sys 41.09 23.27 34.86
bw MB/s 26.96 35.16 27.52

write / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 84,687.80 95,043.40 74,799.60
max IOPS 107,620.80 113,572.00 96,377.20
avg IOPS 97,910.86 105,927.38 87,239.07
cpu % usr 5.43 4.38 3.72
cpu % sys 21.73 20.29 30.97
bw MB/s 400.80 433.80 357.40

Board: sm8650-hdk
read / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 4,867.20 5,596.80 4,242.80
max IOPS 5,211.60 5,970.00 4,548.80
avg IOPS 5,126.12 5,847.93 4,370.14
cpu % usr 3.83 2.81 2.62
cpu % sys 18.29 13.44 16.89
bw MB/s 20.98 17.88 23.96

read / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 47,583.80 46,831.60 47,671.20
max IOPS 58,913.20 59,442.80 56,282.80
avg IOPS 53,609.04 44,396.88 53,621.46
cpu % usr 3.57 3.06 3.11
cpu % sys 15.23 19.31 15.90
bw MB/s 219.40 219.60 210.80

write / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 6,529.42 8,367.20 6,492.80
max IOPS 7,856.92 9,244.40 7,184.80
avg IOPS 7,676.21 8,991.67 6,904.67
cpu % usr 10.17 7.98 3.68
cpu % sys 37.55 34.41 23.07
bw MB/s 31.44 28.28 36.84

write / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 86,304.60 94,288.80 78,433.60
max IOPS 105,670.80 110,373.60 96,330.80
avg IOPS 97,418.81 103,789.76 88,468.27
cpu % usr 4.98 3.27 3.67
cpu % sys 21.45 30.85 20.08
bw MB/s 399.00 362.40 425.00

Assisted analysis gives:

IOPS (Input/Output Operations Per Second):
The v6.16 kernel shows a slight increase in average IOPS compared to v6.15 (43245.69 vs. 42144.88).
The v6.16+fix kernel significantly reduces average IOPS, dropping to 36946.17.

Bandwidth (MB/s):
The v6.16 kernel shows an increase in average bandwidth compared to v6.15 (180.72 MB/s vs. 172.59 MB/s).
The v6.16 with this commit significantly reduces average bandwidth, dropping to 151.32 MB/s.

Detailed Analysis:
Impact of v6.16 Kernel:
The v6.16 kernel introduces a minor improvement in IO performance compared to v6.15.
Both average IOPS and average bandwidth saw a small increase. This suggests that the v6.16
kernel might have introduced some optimizations that slightly improved overall IO performance.

Impact of the Fix:
The potential introduced appears to have a negative impact on both IOPS and bandwidth.
Both metrics show a substantial decrease compared to both v6.15 and v6.16.
This indicates that the fix might be detrimental to IO performance.

The threaded IRQ change did increase IOPS and Bandwidth, and stopped starving interrupts.
This change gives worse numbers than before the threaded IRQ.

Neil


Best regards,

Bart.


optimize() {
    local clkgate_enable c d devfreq disable_cpuidle governor nomerges iostats
    local target_freq ufs_irq_path

    if [ "$1" = performance ]; then
    clkgate_enable=0
    devfreq=max
    disable_cpuidle=1
    governor=performance
    # Enable I/O statistics because the performance impact is low and
    # because fio reports the I/O statistics.
    iostats=1
    # Disable merging to make tests follow the fio arguments.
    nomerges=2
    target_freq=cpuinfo_max_freq
    persist_logs=false
    else
    clkgate_enable=1
    devfreq=min
    disable_cpuidle=0
    governor=sched_pixel
    iostats=1
    nomerges=0
    target_freq=cpuinfo_min_freq
    persist_logs=true
    fi

    for c in $(adb shell "echo /sys/devices/system/cpu/cpu[0-9]*"); do
    for d in $(adb shell "echo $c/cpuidle/state[1-9]*"); do
        adb shell "if [ -e $d ]; then echo $disable_cpuidle > $d/disable; fi"
    done
    adb shell "cat $c/cpufreq/cpuinfo_max_freq > $c/cpufreq/scaling_max_freq;
                   cat $c/cpufreq/${target_freq} > $c/cpufreq/scaling_min_freq;
                   echo ${governor} > $c/cpufreq/scaling_governor; true" \
            2>/dev/null
    done

    if [ "$(adb shell grep -c ufshcd /proc/interrupts)" = 1 ]; then
    # No MCQ or MCQ disabled. Make the fastest CPU core process UFS
    # interrupts.
    # shellcheck disable=SC2016
    ufs_irq_path=$(adb shell 'a=$(echo /proc/irq/*/ufshcd); echo ${a%/ufshcd}')
    adb shell "echo ${fastest_cpucore} > ${ufs_irq_path}/smp_affinity_list; true"
    else
    # MCQ is enabled. Distribute the completion interrupts over the
    # available CPU cores.
    local i=0
    local irqs
    irqs=$(adb shell "sed -n 's/:.*GIC.*ufshcd.*//p' /proc/interrupts")
    for irq in $irqs; do
        adb shell "echo $i > /proc/irq/$irq/smp_affinity_list; true"
        i=$((i+1))
    done
    fi

    for d in $(adb shell echo /sys/class/devfreq/*); do
    case "$d" in
        *gpu0)
        continue
        ;;
    esac
    local min_freq
    min_freq=$(adb shell "cat $d/available_frequencies |
        tr ' ' '\n' |
        sort -n |
        case $devfreq in
            min) head -n1;;
            max) tail -n1;;
        esac")
    adb shell "echo $min_freq > $d/min_freq"
    # shellcheck disable=SC2086
    if [ "$devfreq" = "max" ]; then
        echo "$(basename $d)/min_freq: $(adb shell cat $d/min_freq) <> $min_freq"
    fi
    done

    for d in $(adb shell echo /sys/devices/platform/*.ufs); do
    adb shell "echo $clkgate_enable > $d/clkgate_enable"
    done

    adb shell setprop logd.logpersistd.enable ${persist_logs}

    adb shell "for b in /sys/class/block/{sd[a-z],dm*}; do
            if [ -e \$b ]; then
            [ -e \$b/queue/iostats     ] && echo ${iostats}   >\$b/queue/iostats;
            [ -e \$b/queue/nomerges    ] && echo ${nomerges}  >\$b/queue/nomerges;
            [ -e \$b/queue/rq_affinity ] && echo 2            >\$b/queue/rq_affinity;
            [ -e \$b/queue/scheduler   ] && echo ${iosched}   >\$b/queue/scheduler;
            fi
        done; true"

    adb shell "grep -q '^[^[:blank:]]* /sys/kernel/debug' /proc/mounts || mount -t debugfs none /sys/kernel/debug"
}