[PATCH] doc: code generation style

From: Alexey Dobriyan
Date: Thu Mar 05 2020 - 14:03:00 EST


I wonder if it would be useful to have something like this in tree.

It states trivial things for anyone who looked at disassembly few times
but still...

Signed-off-by: Alexey Dobriyan <adobriyan@xxxxxxxxx>
---

Documentation/process/code-generation.rst | 196 ++++++++++++++++++++++++++++++
1 file changed, 196 insertions(+)

new file mode 100644
--- /dev/null
+++ b/Documentation/process/code-generation.rst
@@ -0,0 +1,196 @@
+Code generation
+===============
+
+1) Generic techniques
+---------------------
+
+### a) Inlining/uninlining function calls ###
+
+External function call is serious business from code generation point of view.
+ABIs require that specific arguments are placed into specific registers before
+doing the call forcing spilling and register shuffling to accomodate ABI rules.
+Clobbered registers which aren't used by a function are wasted. Declaring
+function as ``static inline`` in a header gives compiler more information
+to work with.
+
+However, excessing inlining often leads to code bloat for no measurable
+performance gains. In such case it is probably better to save on generated code
+for icache, disk I/O and network bandwidth costs.
+
+Use ``noinline`` attribute to prevent inlining inside translation unit and
+see what happens:
+
+.. code-block:: c
+
+ noinline
+ int f()
+ {
+ ...
+ }
+
+It is hard to advice any more than that as modern compilers generate code in
+mysterious ways.
+
+
+### b) Appending arguments ###
+
+Some functions are thin wrappers appending an argument or two to another
+function which actually does the job:
+
+.. code-block:: c
+
+ int g(int, int, flag_t);
+ int f(int a, int b)
+ {
+ return g(a, b, FLAG_C);
+ }
+
+Appending an argument at the end adds minimum amount of code:
+
+.. code-block:: none
+
+ f:
+ mov edx, FLAG_C
+ jmp g
+
+Appending an argument in the middle or in the beginning will generate
+reshuffle sequence:
+
+.. code-block:: none
+
+ f:
+ mov edx, esi
+ mov esi, edi
+ mov edi, FLAG_C
+ jmp g
+
+Do not enforce this rule religiously as there may be other reasons for
+specific argument order most notably keeping related arguments together
+at source level.
+
+
+2) Architecture specific issues (i386/x86_64)
+---------------------------------------------
+
+### a) Member placement ###
+
+First member of any structure is very special on i386/x86_64: compiler will
+use ``[r32]`` or ``[r64]`` addressing mode which has the shortest encoding.
+After laying out members of a structure into cachelines for performance,
+move most often used member of the first cacheline to the very beginning.
+
+Done that, pay attention to bytes 1--127. Members placed there will be encoded
+with ``[r64+disp8]`` encoding (or ``[r32+disp8]`` on i386). This is only 1 byte
+longer than encoding used for the first member but 3 bytes _shorter_ than
+``[r64+disp32]`` used for all other members. Try to shift more often used
+members into first 2 cachelines.
+
+"Refugee" members living in byte 128 and beyond can be placed in any order.
+
+
+### b) Implicit 32/64-bit casts
+
+Avoid casts which change signedness and/or bitness of a value.
+
+If some piece of data appears in the code it generally should be kept in its
+original type unless there are specific reasons to do otherwise (packing, etc).
+With C's seemingly arcane implicit and explicit casting rules this is good advice
+from programming language point of view as well.
+
+Given the code:
+
+.. code-block:: c
+
+ void f(size_t);
+
+ int len = strlen(s);
+ f(len);
+
+if compiler doesn't or can't maintain value ranges through casts it will have
+no choice but to assume that all "size_t" values are possible and emit MOVSX
+instruction:
+
+.. code-block:: none
+
+ mov rdi, ...
+ call strlen
+ movsx rdi, eax
+ call f
+
+MOVSX by itself it not a problem but it a) may be 1 byte longer than MOV
+instruction with same arguments and b) it won't be handled by register renaming,
+increasing dependency chains by 1 instruction.
+
+
+### c) 64-bitness ###
+
+64-bit instruction are 1-byte longer than corresponding 32-bit equivalents
+on x86_64.
+
+There is one big 64-bit enabler which is dynamic memory allocation: all
+kmalloc variant accept ``size_t`` and ``sizeof`` operator returns ``size_t``.
+
+Do not use 64-bit/``size_t`` unless strictly necessary (pointer-to-integer
+conversion, syscall ABI interfaces, integers which can be genuinely big on
+big machines, statistics).
+
+Use 32-bit ``unsigned int``. Kernel simply doesn't to individual 4+ GB
+allocations and if it does it probably goes via page allocator. Such huge
+amounts of memory simply aren't needed: network doesn't do gigabyte packets,
+VFS caps IO at 2 GB minus a little and interating with userspace via
+``copy_from_user``/``copy_to_user`` is capped at ``INT_MAX`` as well.
+
+.. code-block:: c
+
+ #define MAX_RW_COUNT (INT_MAX & PAGE_MASK)
+
+The only exceptional case is ``size_t`` value being passed directly into
+a standard function accepting ``size_t`` (``memset``, ``memcpy``, ...).
+Truncating value to 32-bit won't do anything useful in this case.
+
+
+### d) 16-bitness ###
+
+16-bit instructions will generate 1-byte operand size override prefix (66)
+which again bloats an instruction by 1 byte. Unlike REX prefixes, this is
+unavoidable.
+
+It is better to use 16-bit types at ABI/protocol/memory level, convert
+to plain ``int``/``unsigned int`` as soon as possible and work with that.
+
+Preferred order of bitness on x86_64 is:
+
+ 32/8-bit > 64-bit > 16-bit.
+
+3) Architecture specific issues (arm/arm64)
+-------------------------------------------
+
+### Constant flags value selection ###
+
+"Tight" constants can be loaded into a register in 1 instruction on arm and
+other RISC architectures:
+
+.. code-block:: c
+
+ int f()
+ {
+ return 1;
+ }
+
+.. code-block:: none
+
+ 00000000 <f>:
+ 0: e3a00001 mov r0, #1
+ 4: e12fff1e bx lr
+
+Constants which don't fit into 12-bit window on arm will be loaded from memory
+or constructed with 2 loads:
+
+.. code-block:: none
+
+ 00000000 <f>:
+ 0: e59f0000 ldr r0, [pc] ; 8 <f+0x8>
+ 4: e12fff1e bx lr
+ 8: 00000801 .word 0x00000801 ; <=== 2049
+
+After settling on flags/constants push often used values together bitwise.