Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

From: Laura Abbott
Date: Tue Jan 15 2019 - 20:29:03 EST


On 1/10/19 1:09 PM, Khalid Aziz wrote:
I am continuing to build on the work Juerg, Tycho and Julian have done
on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were flushed
actively after an XPFO TLB update. Benchmark for measuring performance
is kernel build using parallel make. To get full protection from
ret2dir attackes, we must flush stale TLB entries. Performance
penalty from flushing stale TLB entries goes up as the number of
cores goes up. On a desktop class machine with only 4 cores,
enabling TLB flush for stale entries causes system time for "make
-j4" to go up by a factor of 2.614x but on a larger machine with 96
cores, system time with "make -j60" goes up by a factor of 26.366x!
I have been working on reducing this performance penalty.

I implemented a solution to reduce performance penalty and
that has had large impact. When XPFO code flushes stale TLB entries,
it does so for all CPUs on the system which may include CPUs that
may not have any matching TLB entries or may never be scheduled to
run the userspace task causing TLB flush. Problem is made worse by
the fact that if number of entries being flushed exceeds
tlb_single_page_flush_ceiling, it results in a full TLB flush on
every CPU. A rogue process can launch a ret2dir attack only from a
CPU that has dual mapping for its pages in physmap in its TLB. We
can hence defer TLB flush on a CPU until a process that would have
caused a TLB flush is scheduled on that CPU. I have added a cpumask
to task_struct which is then used to post pending TLB flush on CPUs
other than the one a process is running on. This cpumask is checked
when a process migrates to a new CPU and TLB is flushed at that
time. I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before this optimization and then
again after applying this optimization. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20 915.183s
4.20+XPFO 24129.354s 26.366x
4.20+XPFO+Deferred flush 1216.987s 1.330xx


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20 607.671s
4.20+XPFO 1588.646s 2.614x
4.20+XPFO+Deferred flush 794.473s 1.307xx

30+% overhead is still very high and there is room for improvement.
Dave Hansen had suggested batch updating TLB entries and Tycho had
created an initial implementation but I have not been able to get
that to work correctly. I am still working on it and I suspect we
will see a noticeable improvement in performance with that. In the
code I added, I post a pending full TLB flush to all other CPUs even
when number of TLB entries being flushed on current CPU does not
exceed tlb_single_page_flush_ceiling. There has to be a better way
to do this. I just haven't found an efficient way to implemented
delayed limited TLB flush on other CPUs.

I am not entirely sure if switch_mm_irqs_off() is indeed the right
place to perform the pending TLB flush for a CPU. Any feedback on
that will be very helpful. Delaying full TLB flushes on other CPUs
seems to help tremendously, so if there is a better way to implement
the same thing than what I have done in patch 16, I am open to
ideas.

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

Since not flushing stale TLB entries creates a false sense of
security, I would recommend making TLB flush mandatory and eliminate
the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
flushing by default for XPFO page table modifications").

What remains to be done beyond this patch series:

1. Performance improvements
2. Remove xpfotlbflush parameter
3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
from Juerg. I dropped it for now since swiotlb code for ARM has
changed a lot in 4.20.
4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
CPUs" to other architectures besides x86.


---------------------------------------------------------

Juerg Haefliger (5):
mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
swiotlb: Map the buffer if it was unmapped by XPFO
arm64/mm: Add support for XPFO
arm64/mm, xpfo: temporarily map dcache regions
lkdtm: Add test for XPFO

Julian Stecklina (4):
mm, x86: omit TLB flushing by default for XPFO page table
modifications
xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
xpfo, mm: optimize spinlock usage in xpfo_kunmap
EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap

Khalid Aziz (2):
xpfo, mm: Fix hang when booting with "xpfotlbflush"
xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

Tycho Andersen (5):
mm: add MAP_HUGETLB support to vm_mmap
x86: always set IF before oopsing from page fault
xpfo: add primitives for mapping underlying memory
arm64/mm: disable section/contiguous mappings if XPFO is enabled
mm: add a user_virt_to_phys symbol

.../admin-guide/kernel-parameters.txt | 2 +
arch/arm64/Kconfig | 1 +
arch/arm64/mm/Makefile | 2 +
arch/arm64/mm/flush.c | 7 +
arch/arm64/mm/mmu.c | 2 +-
arch/arm64/mm/xpfo.c | 58 ++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 26 ++
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/fault.c | 10 +
arch/x86/mm/pageattr.c | 23 +-
arch/x86/mm/tlb.c | 27 ++
arch/x86/mm/xpfo.c | 171 ++++++++++++
drivers/misc/lkdtm/Makefile | 1 +
drivers/misc/lkdtm/core.c | 3 +
drivers/misc/lkdtm/lkdtm.h | 5 +
drivers/misc/lkdtm/xpfo.c | 194 ++++++++++++++
include/linux/highmem.h | 15 +-
include/linux/mm.h | 2 +
include/linux/mm_types.h | 8 +
include/linux/page-flags.h | 13 +
include/linux/sched.h | 9 +
include/linux/xpfo.h | 90 +++++++
include/trace/events/mmflags.h | 10 +-
kernel/dma/swiotlb.c | 3 +-
mm/Makefile | 1 +
mm/mmap.c | 19 +-
mm/page_alloc.c | 3 +
mm/util.c | 32 +++
mm/xpfo.c | 247 ++++++++++++++++++
security/Kconfig | 29 ++
32 files changed, 974 insertions(+), 43 deletions(-)
create mode 100644 arch/arm64/mm/xpfo.c
create mode 100644 arch/x86/mm/xpfo.c
create mode 100644 drivers/misc/lkdtm/xpfo.c
create mode 100644 include/linux/xpfo.h
create mode 100644 mm/xpfo.c


So this seems to blow up immediately on my arm64 box with a config
based on Fedora:

[ 11.008243] Unable to handle kernel paging request at virtual address ffff8003f8602f9b
[ 11.016133] Mem abort info:
[ 11.018926] ESR = 0x96000007
[ 11.021967] Exception class = DABT (current EL), IL = 32 bits
[ 11.027858] SET = 0, FnV = 0
[ 11.030904] EA = 0, S1PTW = 0
[ 11.034030] Data abort info:
[ 11.036896] ISV = 0, ISS = 0x00000007
[ 11.040718] CM = 0, WnR = 0
[ 11.043672] swapper pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
[ 11.050523] [ffff8003f8602f9b] pgd=00000043ffff7803, pud=00000043fe113803, pmd=00000043fc376803, pte=00e80043f8602f13
[ 11.061094] Internal error: Oops: 96000007 [#3] SMP
[ 11.065948] Modules linked in: xfs libcrc32c sdhci_of_arasan sdhci_pltfm sdhci i2c_xgene_slimpro cqhci gpio_dwapb xhci_plat_hcd gpio_xgene_sb gpio_keys
[ 11.079454] CPU: 3 PID: 577 Comm: systemd-getty-g Tainted: G D 4.20.0-xpfo+ #9
[ 11.087936] Hardware name: www.apm.com American Megatrends/American Megatrends, BIOS 3.07.06 20/03/2015
[ 11.097285] pstate: 00400005 (nzcv daif +PAN -UAO)
[ 11.102057] pc : __memcpy+0x20/0x180
[ 11.105616] lr : __access_remote_vm+0x7c/0x1f0
[ 11.110036] sp : ffff000011cb3c20
[ 11.113333] x29: ffff000011cb3c20 x28: ffff8003f8602000
[ 11.118619] x27: 0000000000000f9b x26: 0000000000001000
[ 11.123904] x25: 000083ffffffffff x24: cccccccccccccccd
[ 11.129189] x23: ffff8003d7c53000 x22: 0000000000000044
[ 11.134474] x21: 0000fffff0591f9b x20: 0000000000000044
[ 11.139759] x19: 0000000000000044 x18: 0000000000000000
[ 11.145044] x17: 0000000000000002 x16: 0000000000000000
[ 11.150329] x15: 0000000000000000 x14: 0000000000000000
[ 11.155614] x13: 0000000000000000 x12: 0000000000000000
[ 11.160899] x11: 0000000000000000 x10: 0000000000000000
[ 11.166184] x9 : 0000000000000000 x8 : 0000000000000000
[ 11.171469] x7 : 0000000000000000 x6 : ffff8003d7c53000
[ 11.176754] x5 : 00e00043f8602fd3 x4 : 0000000000000005
[ 11.182038] x3 : 00000003f8602000 x2 : 000000000000003f
[ 11.187323] x1 : ffff8003f8602f9b x0 : ffff8003d7c53000
[ 11.192609] Process systemd-getty-g (pid: 577, stack limit = 0x(____ptrval____))
[ 11.199967] Call trace:
[ 11.202400] __memcpy+0x20/0x180
[ 11.205611] access_remote_vm+0x4c/0x60
[ 11.209428] environ_read+0x12c/0x260
[ 11.213071] __vfs_read+0x48/0x158
[ 11.216454] vfs_read+0x94/0x150
[ 11.219665] ksys_read+0x54/0xb0
[ 11.222875] __arm64_sys_read+0x24/0x30
[ 11.226691] el0_svc_handler+0x94/0x110
[ 11.230508] el0_svc+0x8/0xc
[ 11.233375] Code: f2400c84 540001c0 cb040042 36000064 (38401423)
[ 11.239439] ---[ end trace 4132d3416fb70591 ]---

I'll see if I get some time tomorrow to dig into this unless
someone spots a problem sooner.

Thanks,
Laura