Re: [PATCH 2/2] kdb: Call vkdb_printf() from vprintk_default() only when wanted

From: Daniel Thompson
Date: Tue Nov 22 2016 - 09:32:47 EST


On 22/11/16 12:45, Petr Mladek wrote:
On Mon 2016-11-07 10:24:22, Daniel Thompson wrote:
On 21/10/16 13:50, Petr Mladek wrote:
kdb_trap_printk allows to pass normal printk() messages to kdb via
vkdb_printk(). For example, it is used to get backtrace using
the classic show_stack(), see kdb_show_stack().

vkdb_printf() tries to avoid a potential infinite loop by disabling
the trap. But this approach is racy, for example:

CPU1 CPU2

vkdb_printf()
// assume that kdb_trap_printk == 0
saved_trap_printk = kdb_trap_printk;
kdb_trap_printk = 0;

kdb_show_stack()
kdb_trap_printk++;

When kdb is running any of the commands that use kdb_trap_printk
there is a single active CPU and the other CPUs should be in a
holding pen inside kgdb_cpu_enter().

The only time this is violated is when there is a timeout waiting
for the other CPUs to report to the holding pen.

It means that the race window is small but it is there. Do I get
it correctly, please?

I don't think the size of the race window is particularly interesting; it does exist.

It is more important to be clear that the circumstances when the race exists are when an attempt to stop-the-world has failed. So rather than close the race it might be more useful to make the race unreachable by bringing the CPU that has failed to report under control.


Thanks a lot for explanation. I was not sure how exactly this worked.
I only saw the games with kdb_printf_cpu in vkdb_printf(). Therefore
I expected that some parallelism was possible.

Yes. I'm coming round to the point-of-view that if the code looks like it expected to run in parallel if should be correct for it to run in parallel! Otherwise its leads to broken thinking when changing things.


diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index d5e397315473..db73e33811e7 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1941,7 +1941,9 @@ int vprintk_default(const char *fmt, va_list args)
int r;

#ifdef CONFIG_KGDB_KDB
- if (unlikely(kdb_trap_printk)) {
+ /* Allow to pass printk() to kdb but avoid a recursion. */
+ if (unlikely(kdb_trap_printk &&
+ kdb_printf_cpu != smp_processor_id())) {

Firstly, why !=?

Having re-read things, this comment was spurious.


Secondly, if kdb_trap_printk is set and the "wrong" CPU calls printk
then we have an opportunity to trap a rouge processor in the holding
pen meaning the test should probably be part of vkdb_printk()
anyway.

I agree that it is confusing:

On one hand, vkdb_printf() explicitly allows recursion on the same
CPU. See the handling of kdb_printf_lock before the 1st patch from
this series. Also it mentioned by the comment:

/* Serialize kdb_printf if multiple cpus try to write at once.
* But if any cpu goes recursive in kdb, just print the output,
* even if it is interleaved with any other text.
*/
>
On the other hand. The lines

saved_trap_printk = kdb_trap_printk;
kdb_trap_printk = 0;

means that someone wanted to explicitly disable recursion via
the generic printk(). This is the reason why I used "!=" and why
I added this check into vprintk_default().

vkdb_printf()'s rough job is to:

1. Issue the string to the kdb console (not the same as kernel
console)
2. Store the string in the kernel logs.

Of the above, #2 is achieved by temporarily setting the console log level to 0, calling printk() and restoring the log level. This is why vkdb_printf() needs the silly dance to avoid the recursion.


By other words, we allow recursion caused by kdb internal messages
that are printed directly by kdb_printf(). But we disable recursion
caused by all other messages that are printed using the generic
printk(). This patch keeps the logic. It might make some sense.
But it is hard for me to judge.

Yes, I think it makes sense.

Right now I'm toying with the idea of using the printk_func code for printk() interception.

Firstly printk_func is per-cpu which removes a huge chain of complex reasoning (for a case that doesn't really exist).

Secondly we could call the saved_printk_func directly making recursion impossible.

Sadly, if we accept that a rouge CPU might still be calling printk() then the approach above just opens up a race on the console log level...

Sigh...


Daniel.
Best Regards,
Petr