Re: [PATCH] x86/dumpstack: Walk frames when built with frame pointers

From: Richard Yao
Date: Sun Apr 27 2014 - 15:52:28 EST


I happened to write this with an email client that used Rich Text, which the LKML rejected, so I am resending it in plaintext format. My future replies will be made in plaintext, which is what my normal development system’s email client does.

On Apr 27, 2014, at 8:08 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:

>
> * Richard Yao <ryao@xxxxxxxxxx> wrote:
>
>> Stack traces are generated by scanning the stack and interpeting
>> anything that looks like it could be a pointer to something. We do
>> not need to do this when we have frame pointers, but we do it
>> anyway, with the distinction that we use the return pointers to mark
>> actual frames by the absence of a question mark.
>>
>> The additional verbosity of stack scanning tends to bombard us with
>> walls of text for no gain in practice, so lets switch to printing
>> only stack frames when frame pointers are available. That we can
>> spend less time reading stack traces and more time looking at code.
>>
>> Signed-off-by: Richard Yao <ryao@xxxxxxxxxx>
>> ---
>> arch/x86/kernel/dumpstack.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
>> index d9c12d3..94ffe06 100644
>> --- a/arch/x86/kernel/dumpstack.c
>> +++ b/arch/x86/kernel/dumpstack.c
>> @@ -162,7 +162,11 @@ static void print_trace_address(void *data, unsigned long addr, int reliable)
>> static const struct stacktrace_ops print_trace_ops = {
>> .stack = print_trace_stack,
>> .address = print_trace_address,
>> +#ifdef CONFIG_FRAME_POINTER
>> + .walk_stack = print_context_stack_bp,
>> +#else
>> .walk_stack = print_context_stack,
>> +#endif
>> };
>
> I don't really like this patch, and this question comes up often, so
> let me try to explain it more verbosely.
>
> Lets take a real-life example of a BP-only trace about a lockup:
>
> Call Trace:
> [<79018d24>] arch_trigger_all_cpu_backtrace+0x3c/0x64
> [<7916a34c>] do_raw_spin_lock+0xb7/0xe8
> [<792b9412>] _raw_spin_lock_irqsave+0x35/0x3e
> [<790391e8>] prepare_to_wait+0x18/0x57
> [<792b7158>] __wait_on_bit+0x20/0x5f
> [<792b71eb>] out_of_line_wait_on_bit+0x54/0x5c
> [<790e4eff>] __wait_on_buffer+0x1b/0x1e
> [<7910f615>] __ext3_get_inode_loc+0x1ef/0x254
> [<7911015f>] ext3_iget+0x45/0x346
> [<79114e03>] ext3_lookup+0x97/0xa6
> [<790c9c0e>] lookup_real+0x20/0x35
> [<790ca1d2>] __lookup_hash+0x2a/0x31
> [<790caa0e>] lookup_slow+0x36/0x8c
> [<790cbdda>] path_lookupat+0xf9/0x5c5
> [<790cc2c5>] filename_lookup+0x1f/0x84
> [<790cde98>] user_path_at_empty+0x3f/0x65
> [<790cdecb>] user_path_at+0xd/0xf
> [<790c6585>] vfs_fstatat+0x40/0x88
> [<790c65f8>] vfs_stat+0x13/0x15
> [<790c67e2>] sys_stat64+0x11/0x22
> [<792b9d04>] syscall_call+0x7/0xb
>
> What does this call trace tell us? That we locked up somewhere in the
> ext3 code. That's all that we know.

Actually, we know some other things:

1. Something should have set the bit and started IO.
2. The callback meant to clear that bit has been executed yet.
3. We want to find out who is responsible for that callback. i.e. We want stack traces of all other threads.

It is possible that the thread handling the callback deadlocked. In that case, backtraces of the other threads should suggest the possibility. Reviewing the backtraces is *much* easier without the added noise of the historical stack frames and in all likelihood, the person doing that review would catch the problem faster when they do reveal the issue. There is also no possibility of it being present in those traces, but missed because of the noise from these frames.

>
> Now lets take a look at a 'verbose' entry of the same lockup, that
> also outputs other entries that 'look like' kernel text addresses:
>
> Call Trace:
> [<79018d24>] arch_trigger_all_cpu_backtrace+0x3c/0x64
> [<7916a34c>] do_raw_spin_lock+0xb7/0xe8
> [<792b9412>] _raw_spin_lock_irqsave+0x35/0x3e
> [<790391e8>] ? prepare_to_wait+0x18/0x57
> [<790391e8>] prepare_to_wait+0x18/0x57
> [<7914a320>] ? generic_make_request+0x80/0xb5
> [<790e4f30>] ? unmap_underlying_metadata+0x2e/0x2e
> [<792b7158>] __wait_on_bit+0x20/0x5f
> [<7914a427>] ? submit_bio+0xd2/0xfb
> [<792b71eb>] out_of_line_wait_on_bit+0x54/0x5c
> [<790e4f30>] ? unmap_underlying_metadata+0x2e/0x2e
> [<79039086>] ? autoremove_wake_function+0x31/0x31
> [<790e4eff>] __wait_on_buffer+0x1b/0x1e
> [<7910f615>] __ext3_get_inode_loc+0x1ef/0x254
> [<7911015f>] ext3_iget+0x45/0x346
> [<79114e03>] ext3_lookup+0x97/0xa6
> [<790c9c0e>] lookup_real+0x20/0x35
> [<790ca1d2>] __lookup_hash+0x2a/0x31
> [<790caa0e>] lookup_slow+0x36/0x8c
> [<790cbdda>] path_lookupat+0xf9/0x5c5
> [<790cc2c5>] filename_lookup+0x1f/0x84
> [<790cde98>] user_path_at_empty+0x3f/0x65
> [<790cdecb>] user_path_at+0xd/0xf
> [<790c6585>] vfs_fstatat+0x40/0x88
> [<7903f844>] ? lg_local_unlock+0x31/0x47
> [<790c65f8>] vfs_stat+0x13/0x15
> [<790c67e2>] sys_stat64+0x11/0x22
> [<790c3c47>] ? __fput+0x187/0x1a0
> [<792b9d37>] ? restore_all+0xf/0xf
> [<79165bb4>] ? trace_hardirqs_on_thunk+0xc/0x10
> [<792b9d04>] syscall_call+0x7/0xb
>
> Firstly, it's still easy to tell the true backtrace: ignore all the
> '?' entries.
>
> But the '?' entries also tell us something more: that recently this
> CPU submitted IO. That might or might not be suspected from the
> original trace, but from the 'extended' trace it's really obvious that
> IO activity happened before the lockup as well.
>
> There were many other examples in the past where '?' entries gave us
> clues about the nature of a bug: they represent a poor man's trace of
> what happened recently on that CPU.
>
> So it's useful information for hairy bugs and it would be sad to
> remove them.

It is true that this thread did block IO at some point in the past. It is also possible that block IO operation is related to this stack trace, but the many ways that those frames could get there makes it impossible to conclude that the operation is related. A few ways that occur to me are:

1. This could have gone onto the stack as a result of directory reclaim.
2. A previous syscall by user land could have triggered block IO on something unrelated.
3. If this had been on a kernel thread that has no user land stack like kswapd, then this information can be expected to be there.

Upon seeing the stack trace you provided, I would want to look at all threads on the system under the assumption that some thread started IO and some other thread (or maybe even an interrupt handler) should be responsible for finishing it. I suspect that is a more effective way to get a handle on the problem. If that does not work, then I likely want to attach kgdb or at the very least, start looking at code paths that could start the IO. In such a case, being told that the current thread might have started IO is not really that useful to me. I still need to consider all possible points where this IO could have started.

In this context, having the historical stack frames in the backtrace seems to be a hinderance, rather than a help. They serve to slow down manual inspection of the actual stacks and can obscure useful information that would have been caught had they never been printed. They also can waste valuable screen space and cause the cause of useful information by overflowing scroll back buffers.

Reviewing the historical stack frames under the assumption that they might be useful encourages one to think that the absence of such frames would be useful give that the presence might be useful, but that could not be further than the truth. Having them appear depends on the kernel .config, the toolchain used, the amount of memory pressure (direct reclaim), the call paths taken, etcetera. In short, it is mostly a matter of luck and I find their usefulness when they do appear in backtraces to be very questionable.

In addition, the lack of determines from non-useful information in backtraces makes it more difficult to catch issues with the debug information itself. This is often the case in binary distributions, which often strip kernel modules by passing --strip-uneeded to strip. That creates modules where `file` reports that they are unstripped, but in actuality, they are stripped of all static symbols. In such cases, the stack frames are often garbage, but someone trying to make sense of historical stack frames might not notice because they are used to seeing strange things. In such cases, trying to interpret these frames is definitely a waste of time, but there is nothing (e.g. a taint message?) to inform us of this.

While I believe that the the historical stack frames have been useful in the past for some difficult bugs (when the stars aligned properly) and can still be useful in such instances, I consider the consequences of printing them to outweigh the benefits. In the common case where these frames are not useful, their presence wastes time that could have gone to things that would actually be useful. In the rare cases where they are useful, people relying on them spend their time on one-off solutions. Spending time on many of these one-off solutions is a less efficient use of time than spending time on tools to catch such issues out of necessity.

In both cases, there seems to plenty of time that would be better spent by not printing these frames. Some possibilities that occur to me are better static analysis tools, more widespread adoption of crash dumps and more time spent on dynamic tracing tools. Any one of them should ultimately be beneficial to the overall bug churn rate and consequently, get more eyeballs on difficult to debug problems, such as the rare ones where these historical stack frames are useful.

> Having said that, your complaint that '?' entries can make reading of
> back traces more difficult is valid as well - so maybe we can do
> something about that.
>
> For example if we used more horizontal separation in the output
> format:
>
> Call Trace:
> [<79018d24>] arch_trigger_all_cpu_backtrace+0x3c/0x64
> [<7916a34c>] do_raw_spin_lock+0xb7/0xe8
> [<792b9412>] _raw_spin_lock_irqsave+0x35/0x3e
> [<790391e8>] # ? prepare_to_wait+0x18/0x57
> [<790391e8>] prepare_to_wait+0x18/0x57
> [<7914a320>] # ? generic_make_request+0x80/0xb5
> [<790e4f30>] # ? unmap_underlying_metadata+0x2e/0x2e
> [<792b7158>] __wait_on_bit+0x20/0x5f
> [<7914a427>] # ? submit_bio+0xd2/0xfb
> [<792b71eb>] out_of_line_wait_on_bit+0x54/0x5c
> [<790e4f30>] # ? unmap_underlying_metadata+0x2e/0x2e
> [<79039086>] # ? autoremove_wake_function+0x31/0x31
> [<790e4eff>] __wait_on_buffer+0x1b/0x1e
> [<7910f615>] __ext3_get_inode_loc+0x1ef/0x254
> [<7911015f>] ext3_iget+0x45/0x346
> [<79114e03>] ext3_lookup+0x97/0xa6
> [<790c9c0e>] lookup_real+0x20/0x35
> [<790ca1d2>] __lookup_hash+0x2a/0x31
> [<790caa0e>] lookup_slow+0x36/0x8c
> [<790cbdda>] path_lookupat+0xf9/0x5c5
> [<790cc2c5>] filename_lookup+0x1f/0x84
> [<790cde98>] user_path_at_empty+0x3f/0x65
> [<790cdecb>] user_path_at+0xd/0xf
> [<790c6585>] vfs_fstatat+0x40/0x88
> [<7903f844>] # ? lg_local_unlock+0x31/0x47
> [<790c65f8>] vfs_stat+0x13/0x15
> [<790c67e2>] sys_stat64+0x11/0x22
> [<790c3c47>] # ? __fput+0x187/0x1a0
> [<792b9d37>] # ? restore_all+0xf/0xf
> [<79165bb4>] # ? trace_hardirqs_on_thunk+0xc/0x10
> [<792b9d04>] syscall_call+0x7/0xb
>
> then the information would still be there, it would still be in the
> right place, but it would also be much easier to ignore the right side
> column it visually. (The '# ?' prefix would also make it clear that
> this information is not so important as the left hand side column.)
>
> Or something like that. Assuming this doesn't confuse tools and such.

If we do something like this, it would be best to make it configurable via a sysctl. We could have settings for something like the behavior you proposed, the original behavior we have now and the behavior I would like to see.

That being said, I would like to know if you still feel this information is useful after hearing my thoughts on it.--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/