Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

From: Christophe Leroy
Date: Wed Mar 02 2022 - 02:05:44 EST




Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
>
>
> On 3/1/22 1:46 PM, Christophe Leroy wrote:
>>
>>
>> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
>>> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>>>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>>>>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
>>>>>> This defines and exports a platform specific custom vm_get_page_prot() via
>>>>>> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
>>>>>> macros can be dropped which are no longer needed.
>>>>>
>>>>> What I would really like to know is why having to run _code_ to work out
>>>>> what the page protections need to be is better than looking it up in a
>>>>> table.
>>>>>
>>>>> Not only is this more expensive in terms of CPU cycles, it also brings
>>>>> additional code size with it.
>>>>>
>>>>> I'm struggling to see what the benefit is.
>>>>
>>>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>>>> protection values. Although that is being run in the core MM and from a
>>>> platform perspective __SXXX, __PXXX are just being exported for a table.
>>>> Looking it up in a table (and applying more constructs there after) is
>>>> not much different than a clean switch case statement in terms of CPU
>>>> usage. So this is not more expensive in terms of CPU cycles.
>>>
>>> I disagree.
>>
>> So do I.
>>
>>>
>>> However, let's base this disagreement on some evidence. Here is the
>>> present 32-bit ARM implementation:
>>>
>>> 00000048 <vm_get_page_prot>:
>>> 48: e200000f and r0, r0, #15
>>> 4c: e3003000 movw r3, #0
>>> 4c: R_ARM_MOVW_ABS_NC .LANCHOR1
>>> 50: e3403000 movt r3, #0
>>> 50: R_ARM_MOVT_ABS .LANCHOR1
>>> 54: e7930100 ldr r0, [r3, r0, lsl #2]
>>> 58: e12fff1e bx lr
>>>
>>> That is five instructions long.
>>
>> On ppc32 I get:
>>
>> 00000094 <vm_get_page_prot>:
>> 94: 3d 20 00 00 lis r9,0
>> 96: R_PPC_ADDR16_HA .data..ro_after_init
>> 98: 54 84 16 ba rlwinm r4,r4,2,26,29
>> 9c: 39 29 00 00 addi r9,r9,0
>> 9e: R_PPC_ADDR16_LO .data..ro_after_init
>> a0: 7d 29 20 2e lwzx r9,r9,r4
>> a4: 91 23 00 00 stw r9,0(r3)
>> a8: 4e 80 00 20 blr
>>
>>
>>>
>>> Please show that your new implementation is not more expensive on
>>> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
>>> the disassembly.
>>
>> With your series I get:
>>
>> 00000000 <vm_get_page_prot>:
>> 0: 3d 20 00 00 lis r9,0
>> 2: R_PPC_ADDR16_HA .rodata
>> 4: 39 29 00 00 addi r9,r9,0
>> 6: R_PPC_ADDR16_LO .rodata
>> 8: 54 84 16 ba rlwinm r4,r4,2,26,29
>> c: 7d 49 20 2e lwzx r10,r9,r4
>> 10: 7d 4a 4a 14 add r10,r10,r9
>> 14: 7d 49 03 a6 mtctr r10
>> 18: 4e 80 04 20 bctr
>> 1c: 39 20 03 15 li r9,789
>> 20: 91 23 00 00 stw r9,0(r3)
>> 24: 4e 80 00 20 blr
>> 28: 39 20 01 15 li r9,277
>> 2c: 91 23 00 00 stw r9,0(r3)
>> 30: 4e 80 00 20 blr
>> 34: 39 20 07 15 li r9,1813
>> 38: 91 23 00 00 stw r9,0(r3)
>> 3c: 4e 80 00 20 blr
>> 40: 39 20 05 15 li r9,1301
>> 44: 91 23 00 00 stw r9,0(r3)
>> 48: 4e 80 00 20 blr
>> 4c: 39 20 01 11 li r9,273
>> 50: 4b ff ff d0 b 20 <vm_get_page_prot+0x20>
>>
>>
>> That is definitely more expensive, it implements a table of branches.
>
> Okay, will split out the PPC32 implementation that retains existing
> table look up method. Also planning to keep that inside same file
> (arch/powerpc/mm/mmap.c), unless you have a difference preference.

My point was not to get something specific for PPC32, but to amplify on
Russell's objection.

As this is bad for ARM and bad for PPC32, do we have any evidence that
your change is good for any other architecture ?

I checked PPC64 and there is exactly the same drawback. With the current
implementation it is a small function performing table read then a few
adjustment. After your change it is a bigger function implementing a
table of branches.

So, as requested by Russell, could you look at the disassembly for other
architectures and show us that ARM and POWERPC are the only ones for
which your change is not optimal ?

See below the difference for POWERPC64.

Current implementation:

0000000000000a60 <.vm_get_page_prot>:
a60: 3d 42 00 00 addis r10,r2,0
a62: R_PPC64_TOC16_HA .data..ro_after_init
a64: 78 89 1e 68 rldic r9,r4,3,57
a68: 39 4a 00 00 addi r10,r10,0
a6a: R_PPC64_TOC16_LO .data..ro_after_init
a6c: 74 88 01 00 andis. r8,r4,256
a70: 7d 2a 48 2a ldx r9,r10,r9
a74: 41 82 00 1c beq a90 <.vm_get_page_prot+0x30>
a78: 60 00 00 00 nop
a7c: 60 00 00 00 nop
a80: 48 00 00 18 b a98 <.vm_get_page_prot+0x38>
a84: 60 00 00 00 nop
a88: 60 00 00 00 nop
a8c: 60 00 00 00 nop
a90: 60 00 00 00 nop
a94: 60 00 00 00 nop
a98: 0f e0 00 00 twui r0,0
a9c: 60 00 00 00 nop
aa0: 38 80 00 10 li r4,16
aa4: 7d 29 23 78 or r9,r9,r4
aa8: f9 23 00 00 std r9,0(r3)
aac: 4e 80 00 20 blr
ab0: 78 84 d9 04 rldicr r4,r4,27,4
ab4: 78 84 e8 c2 rldicl r4,r4,61,3
ab8: 60 84 00 10 ori r4,r4,16
abc: 4b ff ff e8 b aa4 <.vm_get_page_prot+0x44>
ac0: 78 84 d9 04 rldicr r4,r4,27,4
ac4: 78 84 e8 c2 rldicl r4,r4,61,3
ac8: 4b ff ff dc b aa4 <.vm_get_page_prot+0x44>


With your series:

00000000000005b0 <.vm_get_page_prot>:
5b0: 3d 22 00 00 addis r9,r2,0
5b2: R_PPC64_TOC16_HA .toc+0x10
5b4: e9 49 00 00 ld r10,0(r9)
5b6: R_PPC64_TOC16_LO_DS .toc+0x10
5b8: 78 89 16 a8 rldic r9,r4,2,58
5bc: 7d 2a 4a aa lwax r9,r10,r9
5c0: 7d 29 52 14 add r9,r9,r10
5c4: 7d 29 03 a6 mtctr r9
5c8: 3d 20 80 00 lis r9,-32768
5cc: 79 29 07 c6 rldicr r9,r9,32,31
5d0: 4e 80 04 20 bctr
5d4: 00 00 00 ec .long 0xec
5d8: 00 00 00 6c .long 0x6c
5dc: 00 00 00 6c .long 0x6c
5e0: 00 00 00 6c .long 0x6c
5e4: 00 00 00 4c .long 0x4c
5e8: 00 00 00 4c .long 0x4c
5ec: 00 00 00 4c .long 0x4c
5f0: 00 00 00 4c .long 0x4c
5f4: 00 00 00 ec .long 0xec
5f8: 00 00 00 6c .long 0x6c
5fc: 00 00 00 cc .long 0xcc
600: 00 00 00 cc .long 0xcc
604: 00 00 00 4c .long 0x4c
608: 00 00 00 4c .long 0x4c
60c: 00 00 00 dc .long 0xdc
610: 00 00 00 dc .long 0xdc
614: 60 00 00 00 nop
618: 60 00 00 00 nop
61c: 60 00 00 00 nop
620: 61 29 01 05 ori r9,r9,261
624: 74 8a 01 00 andis. r10,r4,256
628: 41 82 00 24 beq 64c <.vm_get_page_prot+0x9c>
62c: 60 00 00 00 nop
630: 60 00 00 00 nop
634: 0f e0 00 00 twui r0,0
638: 60 00 00 00 nop
63c: 60 00 00 00 nop
640: 74 8a 01 00 andis. r10,r4,256
644: 61 29 01 04 ori r9,r9,260
648: 40 82 ff e4 bne 62c <.vm_get_page_prot+0x7c>
64c: 60 00 00 00 nop
650: 60 00 00 00 nop
654: 4b ff ff e0 b 634 <.vm_get_page_prot+0x84>
658: 60 00 00 00 nop
65c: 60 00 00 00 nop
660: 78 84 d9 04 rldicr r4,r4,27,4
664: 78 84 e8 c2 rldicl r4,r4,61,3
668: 7d 29 23 78 or r9,r9,r4
66c: f9 23 00 00 std r9,0(r3)
670: 4e 80 00 20 blr
674: 60 00 00 00 nop
678: 60 00 00 00 nop
67c: 60 00 00 00 nop
680: 38 80 00 10 li r4,16
684: 4b ff ff e4 b 668 <.vm_get_page_prot+0xb8>
688: 60 00 00 00 nop
68c: 60 00 00 00 nop
690: 78 84 d9 04 rldicr r4,r4,27,4
694: 78 84 e8 c2 rldicl r4,r4,61,3
698: 60 84 00 10 ori r4,r4,16
69c: 4b ff ff cc b 668 <.vm_get_page_prot+0xb8>
6a0: 61 29 01 06 ori r9,r9,262
6a4: 4b ff ff 80 b 624 <.vm_get_page_prot+0x74>
6a8: 60 00 00 00 nop
6ac: 60 00 00 00 nop
6b0: 61 29 01 07 ori r9,r9,263
6b4: 4b ff ff 70 b 624 <.vm_get_page_prot+0x74>
6b8: 60 00 00 00 nop
6bc: 60 00 00 00 nop
6c0: 61 29 01 08 ori r9,r9,264
6c4: 4b ff ff 60 b 624 <.vm_get_page_prot+0x74>

Thanks
Christophe