[BUG REPORT] Soft Lockup in smp_call_function_single+0xD8

From: Jeff Merkey
Date: Sun Jan 24 2016 - 19:49:49 EST

Next message: Krzysztof Kozlowski: "[RESEND v2 0/2] arm64: EXYNOS: Consolidate Exynos7 symbol"
Previous message: Prarit Bhargava: "[PATCH 3/3] powercap, intel_rapl, Add ignore_max_time_window_check module parameter for broken BIOSes [v3]"
Next in thread: Jeff Merkey: "Re: [BUG REPORT] Soft Lockup in smp_call_function_single+0xD8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

If I single step with either kgdb, kgdb, or mdb kernel debuggers over
a sysret instruction anywhere in the OS, the system hard hangs in
smp_call_function_single after the debugger releases the system and it
resumes normal operation. The specific place the kernel hangs is in
the loop below. The softlockup detector will periodically detect
this condition when it occurs, but not always, most of the time the
system is just hung and unresponsive.

(2)> u smp_call_function_single+d8
<<<< hard hang in this loop with EDX=3
0xffffffff810fce48 8B55E0 mov edx,DWORD PTR [rbp-32]=0xCE037DC0
0xffffffff810fce4b 83E201 and edx,0x1
0xffffffff810fce4e 75F6 jne
smp_call_function_single+0xd6 (0xffffffff810fce46) (up)
<<<<<
0xffffffff810fce50 EBC3 jmp
smp_call_function_single+0xa5 (0xffffffff810fce15) (up)
0xffffffff810fce52 8B05E08EC700 mov eax,[oops_in_progress]=0x0
0xffffffff810fce58 85C0 test eax,eax
0xffffffff810fce5a 7585 jne
smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
0xffffffff810fce5c 803D8E0C9D0000 cmp [__warned.20610]=0x00,0x0
0xffffffff810fce63 0F8578FFFFFF jne
smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
0xffffffff810fce69 BE24010000 mov esi,0x124
0xffffffff810fce6e 48C7C796B08C81 mov rdi,0xffffffff818cb096
0xffffffff810fce75 894DBC mov DWORD PTR [rbp-68]=0x0,ecx
0xffffffff810fce78 488955C0 mov QWORD PTR
[rbp-64]=0xFFFFFFFFFFFFFF10,rdx
0xffffffff810fce7c E8FF21F8FF call warn_slowpath_null
0xffffffff810fce81 C605690C9D0001 mov [__warned.20610]=0x00,0x1
0xffffffff810fce88 8B4DBC mov ecx,DWORD PTR [rbp-68]=0x0
0xffffffff810fce8b 488B55C0 mov rdx,QWORD PTR
[rbp-64]=0xFFFFFFFFFFFFFF10
0xffffffff810fce8f E94DFFFFFF jmp
smp_call_function_single+0x71 (0xffffffff810fcde1) (up)
0xffffffff810fce94 E8A71EF8FF call __stack_chk_fail
0xffffffff810fce99 0F1F8000000000 nop DWORD PTR [rax]=0x0
(2)> g

The stack backtrace when the bug occurs is:

smp_call_function_single+0xd8
unmap_page_range+0x613
flush_tlb_func+0x0
smp_call_function_many+215
native_flush_tlb_others+0x118
flush_tlb_mm_range+0x61
tlb_flush_mmu_tlbonly+0x6b
tlb_finish_mmu+0x14
unmap_region+0xe2
vma_rb_erase+0x10f
do_unmap+0x217
vm_unmap+0x41
SyS_munmap+0x22
entry_SYSCALL_64_fastpath+0x12

I traced through this code a bunch of times in just normal operations
without triggering the bug to get a feel for what it normally sees in
EDX and it looks like someone has coded a looping function that always
has EDX=0 in every case I saw in the except for when this bug occurs.

So the exact C code this maps fro objdump of kernel/smp.o is:

469: e8 62 fe ff ff callq 2d0 <generic_exec_single>
46e: 8b 55 e0 mov -0x20(%rbp),%edx
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
*/
static void csd_lock_wait(struct call_single_data *csd)
{
while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
471: 83 e2 01 and $0x1,%edx
474: 74 cf je 445 <smp_call_function_single+0xa5>
476: f3 90 pause
<<<<<<<<<<
478: 8b 55 e0 mov -0x20(%rbp),%edx
47b: 83 e2 01 and $0x1,%edx
47e: 75 f6 jne 476 <smp_call_function_single+0xd6>
<<<<<<<<<<<
480: eb c3 jmp 445 <smp_call_function_single+0xa5>
* Can deadlock when called with interrupts disabled.
* We allow cpu's that are not yet online though, as no one else can
* send smp call function interrupt to this cpu and as such deadlocks
* can't happen.
*/
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()

Each time this bug occurs csd->flags is always set to a value of 3 and
never changes. When the system is just running normally, it seems to
be 0 the rest of the time. Setting EDX=0 from the debugger console
clears the hang condition and the system seems to recover except the
system reports this error from the console when you attempt to load
programs, indicating the ability of the system to load shared objects
is fritzed.

#
# ls -l
/lib64/libc.so.6 version GLI not found << this error and no shared
objects will load
#
#

Jeff

Next message: Krzysztof Kozlowski: "[RESEND v2 0/2] arm64: EXYNOS: Consolidate Exynos7 symbol"
Previous message: Prarit Bhargava: "[PATCH 3/3] powercap, intel_rapl, Add ignore_max_time_window_check module parameter for broken BIOSes [v3]"
Next in thread: Jeff Merkey: "Re: [BUG REPORT] Soft Lockup in smp_call_function_single+0xD8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]