Re: [PATCH v5 3/3] squashfs: implement readahead

From: Phillip Lougher
Date: Mon Aug 01 2022 - 00:56:19 EST


On 29/07/2022 06:22, Xiongwei Song wrote:
> Hi Phillip,
>
> Gentle ping.
>
> Regards,
> Xiongwei
>
> On Fri, Jul 15, 2022 at 9:45 AM Xiongwei Song <sxwjean@xxxxxxxxx> wrote:
>>
>> Please see the test results below, which are from my colleague Xiaohong Qi:
>>
>> I test file size from 256KB to 5120KB with thread number
>> 1,2,4,8,16,24,32(run ten times and get it’s average value). The read
>> performance is shown below. The difference of read performance between
>> 4.18 kernel and 5.10(with squashfs_readahead() patch v7) seems is
>> caused by the files whose size is litter than 256KB.
>>
>> T1 T2 T4 T8
>> T16 T24 T32
>> All File Size
>> 4.18 136.8642 100.479 96.5523 96.1569 96.204
>> 96.0587 96.0519
>> 5.10-v7 138.474 103.1351 99.9192 99.7091 99.7894
>> 100.2034 100.4447
>> Delta 1.6098 2.6561 3.3669 3.5522
>> 3.5854 4.1447 4.3928

To clarify what was mentioned later in the email - these results were
obtained using SQUASHFS_DECOMP_MULTI_PERCPU, on a 12 core system?

If so these results are unexpected. There is very little extra
parallelism shown when increasing the threads. There is about
a 36% increase in performance moving from 1 thread to 2 threads, which
is about what I expected, but from there on there is almost no
parellelism improvement, even though you should have 12 available
Squashfs decompressors.

This is the results I get on a rather old 4-core X86_64 system using
virtualisation, off SSD with a Squashfs filesystem created from a set of
Linux kernel repositories and distro root filesystems. So a lot of small files and some larger files.

************************
1 Thread

real 8m4.435s
user 4m1.401s
sys 2m57.680s

2 Threads

real 5m16.647s
user 3m16.984s
sys 2m35.655s

4 Threads

real 3m46.047s
user 2m58.669s
sys 2m20.193s

8 Threads

real 3m0.239s
user 2m41.253s
sys 2m27.935s

16 Threads

real 2m38.329s
user 2m34.478s
sys 2m26.303s
***************************

This is the behaviour I would expect to see, a steadily decreasing
overall clock time, as more threads in parallel mean more Squashfs
decompressors are used. Due to user-space overheads and context
switching, you will generally expect to see a decreasing clock
time even after the number of threads is more than the number of cores
available. The rule of thumb is always to use at least double the number
of real cores.

As such your results are confusing, because they max out after only 2
parallel threads.

This may indicate there is something wrong somewhere in your system,
where I/O is bottlenecking early, or it cannot accomodate multiple
parallel reads and it is locking reads out.

These results remind me of the old days using rotating media, where
there was an expensive disk head SEEK to data blocks. Trying to
read multiple files simultaneously was often self-defeating because the
extra SEEK time swallowed up any parallelism improvements, leading to
negligible, flat and decreasing performance improvement as more threads
were added.

Of course I doubt seek time is involved here, but, a lot of things
can emulate seek time, such as a constant unexpected cost.

As this effect is observed with the "original" Squashfs, this is going
to be external to Squashfs, and unrelated to the readhead patches.

>>
>> Fsize < 256KB
>> 4.18 21.7949 14.6959 11.639 10.5154 10.14
>> 10.1092 10.1425
>> 5.10-v7 23.8629 16.2483 13.1475 12.3697 12.1985
>> 12.8799 13.3292
>> Delta 2.068 1.5524 1.5085 1.8543
>> 2.0585 2.7707 3.1867
>>

This appears to show the readhead patch is performing much worse with
files less than 256KB, than larger files. Which would indicate a
problem with the readahead patch.

But, this may be a symptom of whatever is causing your general
lack of parallelism. i.e. external to Squashfs. When read sizes
are small, any extra fixed costs loom large in the result because
they are a significant proportion of the overall cost. When
read sizes are large, any extra fixed costs are a small proportion
of the overall cost and show up marginally or not at all in the results.

In otherwords, there is already a suspicion there are some unexpected
fixed costs to doing I/O, which results in poor parallel performance.
These fixed costs if they are worse on the later kernel, will show
up here where read sizes are small, and may not show up elsewhere.

I have instrumented and profiled the readahead patches on a large
number of workloads, with various degrees of parallelism and I have
not experienced any unexpected regressions in performance as reported
here on small files.

This is not to say there isn't an undiscovered issue with the
readahead patch, but, I have to say the evidence more points to an
issue with your system rather than the readahead patch.

What I would do here is first investigate why you apear to have
poor parallel I/O scaling.

Phillip