Re: Crash with PREEMPT_RT on aarch64 machine

From: Sebastian Andrzej Siewior
Date: Fri Nov 04 2022 - 12:31:29 EST


On 2022-11-03 12:54:44 [+0100], Jan Kara wrote:
> Hello,
Hi,

> I was tracking down the following crash with 6.0 kernel with
> patch-6.0.5-rt14.patch applied:
>
> [ T6611] ------------[ cut here ]------------
> [ T6611] kernel BUG at fs/inode.c:625!

seems like an off-by-one ;)

> The machine is aarch64 architecture, kernel config is attached. I have seen
> the crashes also with 5.14-rt kernel so it is not a new thing. The crash is
> triggered relatively reliably (on two different aarch64 machines) by our
> performance testing framework when running dbench benchmark against an XFS
> filesystem.

different aarch64 machines as in different SoC? Or the same CPU twice.
And no trouble on x86-64 I guess?

> Now originally I thought this is some problem with XFS or writeback code
> but after debugging this for some time I don't think that anymore.
> clear_inode() complains about inode->i_wb_list being non-empty. In fact
> looking at the list_head, I can see it is corrupted. In all the occurences
> of the problem ->prev points back to the list_head itself but ->next points
> to some list_head that used to be part of the sb->s_inodes_wb list (or
> actually that list spliced in wait_sb_inodes() because I've seen a pointer to
> the stack as ->next pointer as well).

so you assume a delete and add operation in parallel?

> This is not just some memory ordering issue with the check in
> clear_inode(). If I add sb->s_inode_wblist_lock locking around the check in
> clear_inode(), the problem still reproduces.

What about dropping the list_empty() check in sb_mark_inode_writeback()
and sb_clear_inode_writeback() so that the check operation always
happens within the locked section? Either way, missing an add/delete
should result in consistent pointers.

> If I enable CONFIG_DEBUG_LIST or if I convert sb->s_inode_wblist_lock to
> raw_spinlock_t, the problem disappears.
>
> Finally, I'd note that the list is modified from three places which makes
> audit relatively simple. sb_mark_inode_writeback(),
> sb_clear_inode_writeback(), and wait_sb_inodes(). All these places hold
> sb->s_inode_wblist_lock when modifying the list. So at this point I'm at
> loss what could be causing this. As unlikely as it seems to me I've started
> wondering whether it is not some subtle issue with RT spinlocks on aarch64
> possibly in combination with interrupts (because sb_clear_inode_writeback()
> may be called from an interrupt).

This should be modified from a threaded interrupt so interrupts and
preemption should be enabled at this point.
If preemption and or interrupts are disabled at some point then
CONFIG_DEBUG_ATOMIC_SLEEP should complain about it.

spinlock_t and raw_spinlock_t differ slightly in terms of locking.
rt_spin_lock() has the fast path via try_cmpxchg_acquire(). If you
enable CONFIG_DEBUG_RT_MUTEXES then you would force the slow path which
always acquires the rt_mutex_base::wait_lock (which is a raw_spinlock_t)
while the actual lock is modified via cmpxchg.

> Any ideas?
>
> Honza

Sebastian