Re: [PATCH v2 07/13] stackleak: rework poison scanning

From: Alexander Popov
Date: Thu May 26 2022 - 19:25:25 EST


On 24.05.2022 16:31, Mark Rutland wrote:
On Sun, May 15, 2022 at 08:33:01PM +0300, Alexander Popov wrote:
On 10.05.2022 16:13, Mark Rutland wrote:
On Mon, May 09, 2022 at 04:51:35PM +0300, Alexander Popov wrote:
Hello Mark!

On 27.04.2022 20:31, Mark Rutland wrote:
Currently we over-estimate the region of stack which must be erased.

To determine the region to be erased, we scan downards for a contiguous
block of poison values (or the low bound of the stack). There are a few
minor problems with this today:

* When we find a block of poison values, we include this block within
the region to erase.

As this is included within the region to erase, this causes us to
redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
poison.

Right, this can be improved.

* As the loop condition checks 'poison_count <= depth', it will run an
additional iteration after finding the contiguous block of poison,
decrementing 'erase_low' once more than necessary.

Actually, I think the current code is correct.

I'm intentionally searching one poison value more than
STACKLEAK_SEARCH_DEPTH to avoid the corner case. See the BUILD_BUG_ON
assertion in stackleak_track_stack() that describes it:

/*
* Having CONFIG_STACKLEAK_TRACK_MIN_SIZE larger than
* STACKLEAK_SEARCH_DEPTH makes the poison search in
* stackleak_erase() unreliable. Let's prevent that.
*/
BUILD_BUG_ON(CONFIG_STACKLEAK_TRACK_MIN_SIZE > STACKLEAK_SEARCH_DEPTH);

I had read that, but as written that doesn't imply that it's necessary to scan
one more element than STACKLEAK_SEARCH_DEPTH, nor why. I'm more than happy to
change the logic, but I think we need a very clear explanation as to why we
need to scan the specific number of bytes we scan, and we should account for
that *within* STACKLEAK_SEARCH_DEPTH for clarity.

I'll try to explain.

The stackleak gcc plugin instruments the kernel code inserting the
'stackleak_track_stack()' calls for the functions with a stack frame size
greater than or equal to CONFIG_STACKLEAK_TRACK_MIN_SIZE.

The kernel functions with a smaller stack frame are not instrumented (the
'lowest_stack' value is not updated for them).

I understood these points.

It's also worth noting that `noinstr` code will also not be instrumented
regardless of frame size -- we might want some build-time assertion for those.

I developed a trick that shows noinstr functions that stackleak would like to instrument:

diff --git a/scripts/gcc-plugins/stackleak_plugin.c b/scripts/gcc-plugins/stackleak_plugin.c
index 42f0252ee2a4..6db748d44957 100644
--- a/scripts/gcc-plugins/stackleak_plugin.c
+++ b/scripts/gcc-plugins/stackleak_plugin.c
@@ -397,6 +397,9 @@ static unsigned int stackleak_cleanup_execute(void)
const char *fn = DECL_NAME_POINTER(current_function_decl);
bool removed = false;

+ if (verbose)
+ fprintf(stderr, "stackleak: I see noinstr function %s()\n", fn);
+
/*
* Leave stack tracking in functions that call alloca().
* Additional case:
@@ -464,12 +467,12 @@ static bool stackleak_gate(void)
if (STRING_EQUAL(section, ".meminit.text"))
return false;
if (STRING_EQUAL(section, ".noinstr.text"))
- return false;
+ return true;
if (STRING_EQUAL(section, ".entry.text"))
return false;
}

- return track_frame_size >= 0;
+ return false;
}

/* Build the function declaration for stackleak_track_stack() */
@@ -589,8 +592,6 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
build_for_x86 = true;
} else if (!strcmp(argv[i].key, "disable")) {
disable = true;
- } else if (!strcmp(argv[i].key, "verbose")) {
- verbose = true;
} else {
error(G_("unknown option '-fplugin-arg-%s-%s'"),
plugin_name, argv[i].key);
@@ -598,6 +599,8 @@ __visible int plugin_init(struct plugin_name_args *plugin_info,
}
}

+ verbose = true;
+
if (disable) {
if (verbose)
fprintf(stderr, "stackleak: disabled for this translation unit\n");


Building defconfig for x86_64 gives this:

stackleak: I see noinstr function __do_fast_syscall_32()
stackleak: instrument __do_fast_syscall_32(): calls_alloca
--
stackleak: I see noinstr function do_syscall_64()
stackleak: instrument do_syscall_64(): calls_alloca
--
stackleak: I see noinstr function do_int80_syscall_32()
stackleak: instrument do_int80_syscall_32(): calls_alloca
--
stackleak: I see noinstr function do_machine_check()
stackleak: instrument do_machine_check()
--
stackleak: I see noinstr function exc_general_protection()
stackleak: instrument exc_general_protection()
--
stackleak: I see noinstr function fixup_bad_iret()
stackleak: instrument fixup_bad_iret()


The cases with calls_alloca are caused by CONFIG_RANDOMIZE_KSTACK_OFFSET=y.
Kees knows about that peculiarity.

Other cases are noinstr functions with large stack frame:
do_machine_check(), exc_general_protection(), and fixup_bad_iret().

I think adding a build-time assertion is not possible, since it would break
building the kernel.

Any kernel function may leave uninitialized data on its stack frame. The
poison scanning must handle that correctly. The uninitialized groups of
poison values must be smaller than the search depth, otherwise
'stackleak_erase()' is unreliable.

I had understood this, but I had understood that for a caller->callee pair,
*something* would be pushed onto the stack. On arm64 that'd be a frame record
in the caller's stack frame, and on x86 that would be the return address
between the stack frames of the caller and callee. Since any unrecorded frame
is less than CONFIG_STACKLEAK_TRACK_MIN_SIZE, the non-poison bytes would fall
within CONFIG_STACKLEAK_TRACK_MIN_SIZE bytes.

Yes, exactly.

There is a potential edge case on arm64, since the frame record is permitted to
be placed anywhere within the stack frame, and in theory it could be placed
high on the caller and low on the callee. If we wanted to handle that, we'd
need to scan 2 times the tracking size. In practice compilers consistently
place the frame record at one end (usually the low end, as part of constructing
the frame).

Good to hear that.

So with this BUILD_BUG_ON I control that
CONFIG_STACKLEAK_TRACK_MIN_SIZE <= STACKLEAK_SEARCH_DEPTH.

To be sure and avoid mistakes in the edge cases, 'stackleak_erase()' is
searching one poison value more than STACKLEAK_SEARCH_DEPTH.

Just to check my understanding, did you have a specific edge-case in mind, or
was that "just in case"?

I didn't have a specific edge-case, but I wanted to avoid possible weird problems.

It would be really nice if we had an example.

If you don't like this one additional poison value in the search, I would
propose to change the assertion.

If we clearly document *why*, then changing the assertion is fine by me.
However, as above I don't think that this is necessary.

As an aside, why is it possible to configure CONFIG_STACKLEAK_TRACK_MIN_SIZE to
be bigger than STACKLEAK_SEARCH_DEPTH in the first place?

In security/Kconfig.hardening we have:

| config STACKLEAK_TRACK_MIN_SIZE
| int "Minimum stack frame size of functions tracked by STACKLEAK"
| default 100
| range 0 4096
| depends on GCC_PLUGIN_STACKLEAK
| help
| The STACKLEAK gcc plugin instruments the kernel code for tracking
| the lowest border of the kernel stack (and for some other purposes).
| It inserts the stackleak_track_stack() call for the functions with
| a stack frame size greater than or equal to this parameter.
| If unsure, leave the default value 100.

... where the vast majority of that range is going to lead to a BUILD_BUG().

Honestly, I don't like the idea of having the STACKLEAK_TRACK_MIN_SIZE option in the Kconfig.

I was forced by the maintainers to introduce it when I was working on the stackleak patchset.

How about dropping CONFIG_STACKLEAK_TRACK_MIN_SIZE from Kconfig?

That would also allow to drop this build-time assertion.

What do you think?

As this is included within the region to erase, this causes us to
redundantly overwrite an additional unsigned long with poison.

* As we always decrement 'erase_low' after checking an element on the
stack, we always include the element below this within the region to
erase.

As this is included within the region to erase, this causes us to
redundantly overwrite an additional unsigned long with poison.

Note that this is not a functional problem. As the loop condition
checks 'erase_low > task_stack_low', we'll never clobber the
STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
never fail to erase the element immediately above the STACK_END_MAGIC.

Right, I don't see any bug in the current erasing code.

When I wrote the current code, I carefully checked all the corner cases. For
example, on the first stack erasing, the STACK_END_MAGIC was not
overwritten, but the memory next to it was erased. Same for the beginning of
the stack: I carefully checked that no unpoisoned bytes were left on the
thread stack.

In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
bytes more than necessary, which is unfortunate.

This patch reworks the logic to find the address immediately above the
poisoned region, by finding the lowest non-poisoned address. This is
factored into a stackleak_find_top_of_poison() helper both for clarity
and so that this can be shared with the LKDTM test in subsequent
patches.

You know, I wrote stackleak_erase() in very plain C. I wanted a compiler to
generate assembly that is very close to the original asm version. I worried
that compilers might do weird stuff with the local variables and the stack
pointer.

So I checked stackleak for gcc versions 4.8, 5, 6, 7, 8, 9, and 10 on
x86_64, i386 and arm64. This is my project that helped with this work:
https://github.com/a13xp0p0v/kernel-build-containers

I've used the kernel.org cross toolchains, as published at:

https://mirrors.edge.kernel.org/pub/tools/crosstool/

Mark, in this patch series you use many local variables and helper functions.
Honestly, this worries me. For example, compilers can (and usually do)
ignore the presence of the 'inline' specifier for the purpose of
optimization.

I've deliberately used `__always_inline` rather than regular `inline` to
prevent code being placed out-of-line. As mentioned in oether replies it has a
stronger semantic.

Thanks,
Mark.


Thanks!

Signed-off-by: Mark Rutland <mark.rutland@xxxxxxx>
Cc: Alexander Popov <alex.popov@xxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: Kees Cook <keescook@xxxxxxxxxxxx>
---
include/linux/stackleak.h | 26 ++++++++++++++++++++++++++
kernel/stackleak.c | 18 ++++--------------
2 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
index 467661aeb4136..c36e7a3b45e7e 100644
--- a/include/linux/stackleak.h
+++ b/include/linux/stackleak.h
@@ -42,6 +42,32 @@ stackleak_task_high_bound(const struct task_struct *tsk)
return (unsigned long)task_pt_regs(tsk);
}
+/*
+ * Find the address immediately above the poisoned region of the stack, where
+ * that region falls between 'low' (inclusive) and 'high' (exclusive).
+ */
+static __always_inline unsigned long
+stackleak_find_top_of_poison(const unsigned long low, const unsigned long high)
+{
+ const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
+ unsigned int poison_count = 0;
+ unsigned long poison_high = high;
+ unsigned long sp = high;
+
+ while (sp > low && poison_count < depth) {
+ sp -= sizeof(unsigned long);
+
+ if (*(unsigned long *)sp == STACKLEAK_POISON) {
+ poison_count++;
+ } else {
+ poison_count = 0;
+ poison_high = sp;
+ }
+ }
+
+ return poison_high;
+}
+
static inline void stackleak_task_init(struct task_struct *t)
{
t->lowest_stack = stackleak_task_low_bound(t);
diff --git a/kernel/stackleak.c b/kernel/stackleak.c
index ba346d46218f5..afd54b8e10b83 100644
--- a/kernel/stackleak.c
+++ b/kernel/stackleak.c
@@ -74,20 +74,10 @@ static __always_inline void __stackleak_erase(void)
{
const unsigned long task_stack_low = stackleak_task_low_bound(current);
const unsigned long task_stack_high = stackleak_task_high_bound(current);
- unsigned long erase_low = current->lowest_stack;
- unsigned long erase_high;
- unsigned int poison_count = 0;
- const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
-
- /* Search for the poison value in the kernel stack */
- while (erase_low > task_stack_low && poison_count <= depth) {
- if (*(unsigned long *)erase_low == STACKLEAK_POISON)
- poison_count++;
- else
- poison_count = 0;
-
- erase_low -= sizeof(unsigned long);
- }
+ unsigned long erase_low, erase_high;
+
+ erase_low = stackleak_find_top_of_poison(task_stack_low,
+ current->lowest_stack);
#ifdef CONFIG_STACKLEAK_METRICS
current->prev_lowest_stack = erase_low;