Re: frequent lockups in 3.18rc4

From: Linus Torvalds
Date: Thu Dec 18 2014 - 22:49:46 EST


On Thu, Dec 18, 2014 at 6:45 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
>
> Example of the spew-o-rama below.

Hmm. Not only does it apparently stay up for you now, the traces seem
to be improving in quality.

There's a decided pattern of "copy_page_range()" and "zap_page_range()" here.

Now, what's *also* intriguing is how many "_raw_spin_lock_nested"
things there are in there. Which makes me suspect that you are
actually hitting some really nasty spinlock contention, and that your
22s lockups could be due to lock hold times going exponential.

So I don't think that it's the copy_page_range() itself that is
necessarily all that horribly expensive (although it's certainly not a
lightweight function), but the fact that you get contention on some
lock inside that loop, and when you have every single CPU hammering on
it things just go to hell in a handbasket.

And when spinlocks start getting contention, *nested* spinlocks
really really hurt. And you've got all the spinlock debugging on etc,
don't you? Which just makes the locks really expensive, and much much
easier to start becoming contended (and there's *another* level of
nesting, because I think the lockdep stuff has its own locking
inside). So you have three levels of spinlock nesting, and the outer
one will be completely hammered.

So I think the locks you have are from copy_pte_range:

dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
src_pte = pte_offset_map(src_pmd, addr);
src_ptl = pte_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

and we do have some mitigation in place for horrible horrible
contention (try to release every few entries, but with all CPU's
hammering on these locks, and things being slow due to all the
debugging, I think we may finally be hitting the right place here.

Also, you do have this:

sched: RT throttling activated

so there's something going on with RT scheduling too. I'd consider all
the softlockups after that point suspect - the softlockup thread has
presumably used so much CPU spewing out the debug messages that things
aren't really working any more RT-wise.

Lookie here (the "soft lockup" grep is to skip all the cross-CPU
traces from other CPU's that weren't necessarily the problem case):

[torvalds@i7 linux]$ grep -5 "soft lockup" ~/0.txt | grep RIP
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: generic_exec_single+0xee/0x1b0
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: lock_acquire+0xb4/0x120
RIP: 0010: shmem_write_end+0x65/0xf0
RIP: 0010: _raw_spin_unlock_irqrestore+0x38/0x60
RIP: 0010: copy_user_enhanced_fast_string+0x5/0x10
RIP: 0010: copy_user_enhanced_fast_string+0x5/0x10
RIP: 0010: __slab_alloc+0x52f/0x58f
RIP: 0010: map_id_up+0x9/0x80
RIP: 0010: cpuidle_enter_state+0x79/0x190
RIP: 0010: unmap_single_vma+0x7d9/0x900
RIP: 0010: cpuidle_enter_state+0x79/0x190

and notice the pattern? All those early RIP cases are the page table
locks for copy_page_range and that one TLB flush for zap_page_range..

So your printouts are finally starting to make sense. But I'm also
starting to suspect strongly that the problem is that with all your
lock debugging and other overheads (does this still have
DEBUG_PAGEALLOC?) you really are getting into a "real" softlockup
because things are scaling so horribly badly.

If you now disable spinlock debugging and lockdep, hopefully that page
table lock now doesn't always get hung up on the lockdep locking, so
it starts scaling much better, and maybe you'd not see this...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/