Re: [PATCH v2 0/4] kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack

From: Mark Rutland
Date: Thu Apr 14 2022 - 03:02:29 EST


On Wed, Apr 13, 2022 at 09:28:00PM +0200, Andrey Konovalov wrote:
> On Thu, Apr 7, 2022 at 8:42 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >
> > I'm afraid from local testing (atop v5.18-rc1), with your config, I still can't
> > get anywhere near your figures. I've tried to match toolchain versions with
> > what was in your .config file, so I'm using clang 12.0.0 from the llvm.org
> > binary releases, and binutils from the kernel.org crosstool 11.1.0 release.
> >
> > I took baselines with defconfig and defconfig + SHADOW_CALL_STACK, with console
> > output completely suppressed with 'quiet loglevel=0':
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image -append
> > | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 0.512626031 seconds time elapsed ( +- 0.26% )
> > |
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image -append
> > | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 0.523245952 seconds time elapsed ( +- 0.18% )
> >
> > Then I tried the same with your config, without your patches:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> > | loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 1.994692366 seconds time elapsed ( +- 0.05% )
> >
> > Then with your config, without your patches, with the stacktrace hacked out:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 1.861823869 seconds time elapsed ( +- 0.05% )
> >
> > If I use those number to estimate the proportion of time spent stacktracing,
> > with the baseline SCS number discounted to remove the hypervisor+VMM overheads,
> > I get:
> >
> > (1.994692366 - 0.523245952) - (1.861823869 - 0.523245952)
> > --------------------------------------------------------- = 0.09029788358
> > (1.994692366 - 0.523245952)
> >
> > So roughly 9% when I try to maximize that figure. When actually poking hardware
> > and doing real work, that figure goes down. For example, if just using "quiet":
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image > /dev/null
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> > | loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > | 4.653286475 seconds time elapsed ( +- 0.06% )
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image > /dev/null
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > | 4.585750154 seconds time elapsed ( +- 0.05% )
> >
> > Which gives an estimate of:
> >
> > (4.653286475 - 0.523245952) - (4.585750154 - 0.523245952)
> > --------------------------------------------------------- = 0.01635245964
> > (4.653286475 - 0.523245952)
> >
> > ... or ~1.6% time spent backtracing:
> >
> > FWIW, applying your patches do show some benefit, but not as drastic as I was
> > expecting:
> >
> > With console output suprressed:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > | 1.920300410 seconds time elapsed ( +- 0.05% )
> >
> > ... down from ~9% to ~4%
> >
> > With console output merely reduced:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image > /dev/null
> > |
> > | Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > | 4.611277833 seconds time elapsed ( +- 0.04% )
> >
> > ... down from 1.6% to 0.6%
> >
> > Given the above I still think we need to understand this a bit better before we
> > consider pursuing the SCS unwinder, given the issues I laid out in my prior mails.
> >
> > My hope is that we can improve the regular unwinder or other code such that
> > this becomes moot. I'm aware of a few things we could try, but given it's very
> > easy to sink a lot of time and effort into this, I'd like to first get some
> > more details, as above.
>
> Hi Mark,
>
> I'm about to publish v3, where I'll include a detailed description of
> how I measured the performance.
>
> Perhaps we see different performance numbers because you're using
> KVM-enabled VM on an Arm host and I'm using QEMU on x86-64 host.

Hold on; are you using QEMU in TCG mode? If so that's in no way representative
of real HW performance, and there are operations it simply cannot make as fast
as HW can (e.g. pointer authentication using the architected QARMA variants).

> Although, it's suspicious that the difference is so drastic.

I'm not surprised at all. Some operations can be *orders of magnitude slower*
under TCG than on real HW even when considered relative to other operations,
and this can drasticaly skew benchmarks. We recently hit a case when PACIASP
and AUTIASP were so slow under TCG mode they appeared to be causing a boot
hang, and we eventually figured out that they were just *very* slow, adding
minutes to the boot time. Richard Henderson added options to QEMU to mitigate
that (either disabling authentication, or using an IMPLEMENTATION DEFINED
algorithm).

The bottom line is that QEMU TCG mode is in no way representative of real-world
performance, and *cannot* be used for benchmarking.

I think we first need to understand *what* is so slow under QEMU TCG mode, and
*why* TCG mode performance matters. I suspect there are other ways we could
avoid this overhead *without* adding another unwinder, but even then we need a
justification for *why* we should care.

> I'll try to get my hands on some Arm hardware in the next few days and do the
> measurements there.
>
> This new version also will not be making any changes to the entry
> code, as these changes add unwanted additional slowdown. That would be
> great, if you could check the performance impact of v3 with your
> setup.

I'll take a look at the series, but as before I do not think we should add
another unwinder. As above, I *certainly* do not think we should add another
unwinder based on TCG performance.

Thanks,
Mark.