Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression

From: Waiman Long
Date: Mon Sep 28 2015 - 22:57:20 EST


On 09/28/2015 08:47 PM, huang ying wrote:
Hi, Waiman,

On Mon, Sep 28, 2015 at 10:30 PM, Waiman Long<waiman.long@xxxxxxx> wrote:
On 09/28/2015 04:54 AM, huang ying wrote:

Hi, Peter

On Fri, Sep 4, 2015 at 7:32 PM, Peter Zijlstra<peterz@xxxxxxxxxxxxx> wrote:
On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
You probably don't even need a VM to reproduce it - that would
certainly be an interesting counterpoint if it didn't....
Even though you managed to restore your DEBUG_SPINLOCK performance by
changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
actual hardware just to test.

[ Note: In any case, I would recommend you use (or at least try)
PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
performance, the test-and-set fallback really wasn't meant as a
performance option (although it clearly sucks worse than expected).

Pre qspinlock, your setup would have used regular ticket locks on
vCPUs, which mostly works as long as there is almost no vCPU
preemption, if you overload your machine such that the vCPU threads
get preempted that will implode into silly-land. ]

So on to native performance:

- IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
- 1.1T of md-stripe (5x200GB) SSDs
- Linux v4.2 (distro style .config)
- Debian "testing" base system
- xfsprogs v3.2.1


# mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0 isize=512 agcount=32, agsize=9157504 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1
data = bsize=4096 blocks=293038720, imaxpct=5
= sunit=128 swidth=640 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=143088, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

# mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch

# ./fs_mark -D 10000 -S0 -n 50000 -s 0 -L 32 \
-d /mnt/scratch/0 -d /mnt/scratch/1 \
-d /mnt/scratch/2 -d /mnt/scratch/3 \
-d /mnt/scratch/4 -d /mnt/scratch/5 \
-d /mnt/scratch/6 -d /mnt/scratch/7 \
-d /mnt/scratch/8 -d /mnt/scratch/9 \
-d /mnt/scratch/10 -d /mnt/scratch/11 \
-d /mnt/scratch/12 -d /mnt/scratch/13 \
-d /mnt/scratch/14 -d /mnt/scratch/15 \


Regular v4.2 (qspinlock) does:

0 6400000 0 286491.9 3500179
0 7200000 0 293229.5 3963140
0 8000000 0 271182.4 3708212
0 8800000 0 300592.0 3595722

Modified v4.2 (ticket) does:

0 6400000 0 310419.6 3343821
0 7200000 0 348346.5 4721133
0 8000000 0 328098.2 3235753
0 8800000 0 316765.3 3238971


Is the "modified v4.2 (ticket)" means you are just removing ARCH_USE_QUEUED_SPINLOCKS from the config file when building the 2 kernels in the above test? Your config file is for 4.1. If you compare a 4.1 kernel with 4.2 kernel, there are lot more changes than just the qspinlock switch.
I think you are confused between PeterZ's test and my test. PeterZ's
test is for v4.2 and modified v4.2 and he didn't post his
configuration. My test is for
fc934d40178ad4e551a17e2733241d9f29fddd70 and
68722101ec3a0e179408a13708dd020e04f54aab, so my configuration
(attached in previous email) is for v4.1.

Yes, I am sorry that I misread the quoted part.

Could you also use the perf command to profile the 2 cases to see where the performance bottleneck is?


Which shows that qspinlock is clearly slower, even for these large-ish
NUMA boxes where it was supposed to be better.

Clearly our benchmarks used before this were not sufficient, and more
works needs to be done.


Also, I note that after running to completion, there is only 14G of
actual data on the device, so you don't need silly large storage to run
this -- I expect your previous 275G quote was due to XFS populating the
sparse file with meta-data or something along those lines.

Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)

We are trying to reproduce your regression in our test environment (LKP). We tested fs_mark with following command line:

# mkfs -t xfs /dev/ram0
# mount -t xfs -o nobarrier,inode64 /dev/ram0 /fs/ram0
# ./fs_mark -d /fs/ram0/1 -d /fs/ram0/2 -d /fs/ram0/3 -d /fs/ram0/4 -d /fs/ram0/5 -d /fs/ram0/6 -d /fs/ram0/7 -d /fs/ram0/8 -d /fs/ram0/9 -d /fs/ram0/10 -d /fs/ram0/11 -d /fs/ram0/12 -d /fs/ram0/13 -d /fs/ram0/14 -d /fs/ram0/15 -d /fs/ram0/16 -D 10000 -N 5 -n 49152 -L 32 -S 0 -s 0

The test was run on a IVB-EX box, with ramdisk. We tested two commits,

fc934d40178ad4e551a17e2733241d9f29fddd70
68722101ec3a0e179408a13708dd020e04f54aab

I think they were the commits before and after introducing the qspinlock. The test results show no regressions:

fc934d40178ad4e5 68722101ec3a0e179408a13708
---------------- --------------------------
%stddev %change %stddev
\ | \
13214787 Â 0% -1.0% 13088679 Â 1% fsmark.app_overhead
36895 Â 0% -0.1% 36841 Â 0% fsmark.files_per_sec
687.69 Â 0% +0.1% 688.68 Â 0% fsmark.time.elapsed_time
687.69 Â 0% +0.1% 688.68 Â 0% fsmark.time.elapsed_time.max
208.00 Â 0% +0.0% 208.00 Â 0% fsmark.time.file_system_inputs
8.00 Â 0% +0.0% 8.00 Â 0% fsmark.time.file_system_outputs
6627 Â 1% +0.3% 6647 Â 1% fsmark.time.involuntary_context_switches
10904 Â 0% +0.0% 10904 Â 0% fsmark.time.maximum_resident_set_size
307635 Â 0% -1.0% 304646 Â 0% fsmark.time.minor_page_faults
4096 Â 0% +0.0% 4096 Â 0% fsmark.time.page_size
338.33 Â 0% +0.5% 340.00 Â 0% fsmark.time.percent_of_cpu_this_job_got
2119 Â 0% +0.5% 2130 Â 0% fsmark.time.system_time
211.90 Â 0% +1.4% 214.94 Â 0% fsmark.time.user_time
14193260 Â 0% +0.6% 14284812 Â 0% fsmark.time.voluntary_context_switches

Could you give us some help on how to reproduce this regression? Could you provide your kernel configuration? Ours is attached with the email. Or Could you help to pointed out other difference in our configuration?


The regression that was previously reported only happens when run in a VM without PARAVIRT_SPINLOCKS. With bare metal, you won't see that.
Yes. The regressions reported in the first email of the thread is for
a VM. But the regression reported by PeterZ is for bare metal. I am
trying to reproduce that one.

Best Regards,
Huang, Ying

I see. Thanks for the clarification.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/