Re: [PATCH v3 1/2] mbcache: decoupling the locking of local fromglobal data

From: Thavatchai Makphaibulchoke
Date: Wed Oct 30 2013 - 19:37:40 EST


On 10/30/2013 08:42 AM, Theodore Ts'o wrote:
> I tried running xfstests with this patch, and it blew up on
> generic/020 test:
>
> generic/020 [10:21:50][ 105.170352] ------------[ cut here ]------------
> [ 105.171683] kernel BUG at /usr/projects/linux/ext4/include/linux/bit_spinlock.h:76!
> [ 105.173346] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 105.173346] Modules linked in:
> [ 105.173346] CPU: 1 PID: 8519 Comm: attr Not tainted 3.12.0-rc5-00008-gffbe1d7-dirty #1492
> [ 105.173346] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 105.173346] task: f5abe560 ti: f2274000 task.ti: f2274000
> [ 105.173346] EIP: 0060:[<c026b464>] EFLAGS: 00010246 CPU: 1
> [ 105.173346] EIP is at hlist_bl_unlock+0x7/0x1c
> [ 105.173346] EAX: f488d360 EBX: f488d360 ECX: 00000000 EDX: f2998800
> [ 105.173346] ESI: f29987f0 EDI: 6954c848 EBP: f2275cc8 ESP: f2275cb8
> [ 105.173346] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 105.173346] CR0: 80050033 CR2: b76bcf54 CR3: 34844000 CR4: 000006f0
> [ 105.173346] Stack:
> [ 105.173346] c026bc78 f2275d48 6954c848 f29987f0 f2275d24 c02cd7a9 f2275ce4 c02e2881
> [ 105.173346] f255d8c8 00000000 f1109020 f4a67f00 f2275d54 f2275d08 c02cd020 6954c848
> [ 105.173346] f4a67f00 f1109000 f2b0eba8 f2ee3800 f2275d28 f4f811e8 f2275d38 00000000
> [ 105.173346] Call Trace:
> [ 105.173346] [<c026bc78>] ? mb_cache_entry_find_first+0x4b/0x55
> [ 105.173346] [<c02cd7a9>] ext4_xattr_block_set+0x248/0x6e7
> [ 105.173346] [<c02e2881>] ? jbd2_journal_put_journal_head+0xe2/0xed
> [ 105.173346] [<c02cd020>] ? ext4_xattr_find_entry+0x52/0xac
> [ 105.173346] [<c02ce307>] ext4_xattr_set_handle+0x1c7/0x30f
> [ 105.173346] [<c02ce4f4>] ext4_xattr_set+0xa5/0xe1
> [ 105.173346] [<c02ceb36>] ext4_xattr_user_set+0x46/0x5f
> [ 105.173346] [<c024a4da>] generic_setxattr+0x4c/0x5e
> [ 105.173346] [<c024a48e>] ? generic_listxattr+0x95/0x95
> [ 105.173346] [<c024ab0f>] __vfs_setxattr_noperm+0x56/0xb6
> [ 105.173346] [<c024abd2>] vfs_setxattr+0x63/0x7e
> [ 105.173346] [<c024ace8>] setxattr+0xfb/0x139
> [ 105.173346] [<c01b200a>] ? __lock_acquire+0x540/0xca6
> [ 105.173346] [<c01877a3>] ? lg_local_unlock+0x1b/0x34
> [ 105.173346] [<c01af8dd>] ? trace_hardirqs_off_caller+0x2e/0x98
> [ 105.173346] [<c0227e69>] ? kmem_cache_free+0xd4/0x149
> [ 105.173346] [<c01b2c2b>] ? lock_acquire+0xdd/0x107
> [ 105.173346] [<c023225e>] ? __sb_start_write+0xee/0x11d
> [ 105.173346] [<c0247383>] ? mnt_want_write+0x1e/0x3e
> [ 105.173346] [<c01b3019>] ? trace_hardirqs_on_caller+0x12a/0x17e
> [ 105.173346] [<c0247353>] ? __mnt_want_write+0x4e/0x60
> [ 105.173346] [<c024af3b>] SyS_lsetxattr+0x6a/0x9f
> [ 105.173346] [<c078d0e8>] syscall_call+0x7/0xb
> [ 105.173346] Code: 00 00 00 00 5b 5d c3 55 89 e5 53 3e 8d 74 26 00 8b 58 08 89 c2 8b 43 18 e8 3f c9 fb ff f0 ff 4b 0c 5b 5d c3 8b 10 80 e2 01 75 02 <0f> 0b 55 89 e5 0f ba 30 00 89 e0 25 00 e0 ff ff ff 48 14 5d c3
> [ 105.173346] EIP: [<c026b464>] hlist_bl_unlock+0x7/0x1c SS:ESP 0068:f2275cb8
> [ 105.273781] ---[ end trace 1ee45ddfc1df0935 ]---
>
> When I tried to find a potential problem, I immediately ran into this.
> I'm not entirely sure it's the problem, but it's raised a number of
> red flags for me in terms of (a) how much testing you've employed with
> this patch set, and (b) how maintaining and easy-to-audit the code
> will be with this extra locking. The comments are good start, but
> some additional comments about exactly what assumptions a function
> assumes about locks that are held on function entry, or especially if
> the locking is different on function entry and function exit, might
> make it easier for people to audit this patch.
>
> Or maybe this commit needs to be split up with first a conversion from
> using list_head to hlist_hl_node, and the changing the locking? The
> bottom line is that we need to somehow make this patch easier to
> validate/review.
>

Thanks for the comemnts. Yes, I did run through xfstests. My guess is that you probably ran into a race condition that I did not.

I will try to port the patch to a more recent kernel, including the mb_cache_shrink_scan() sent earlier (BTW, it looks good) and debug the problem.

Yes, those are good suggestions. Once I find the problem, I will resubmit with more comments and also split it into two patches, as suggested.

>> @@ -520,18 +647,23 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
>> ce->e_queued++;
>> prepare_to_wait(&mb_cache_queue, &wait,
>> TASK_UNINTERRUPTIBLE);
>> - spin_unlock(&mb_cache_spinlock);
>> + hlist_bl_unlock(head);
>> schedule();
>> - spin_lock(&mb_cache_spinlock);
>> + hlist_bl_lock(head);
>> + mb_assert(ce->e_index_hash_p == head);
>> ce->e_queued--;
>> }
>> + hlist_bl_unlock(head);
>> finish_wait(&mb_cache_queue, &wait);
>>
>> - if (!__mb_cache_entry_is_hashed(ce)) {
>> + hlist_bl_lock(ce->e_block_hash_p);
>> + if (!__mb_cache_entry_is_block_hashed(ce)) {
>> __mb_cache_entry_release_unlock(ce);
>> - spin_lock(&mb_cache_spinlock);
>> + hlist_bl_lock(head);
>
> <--- are we missing a "hlist_bl_unlock(ce->e_block_hash_p);" here?
>
> <---- Why is it that we call "hlist_bl_lock(head);" but not in the else clause?

The function __mb_cache_entry_release_unlock() releases both the e_block_hash and e_index_hash locks. So we have to reacquire the e_index_hash (head) lock in the then part, and release the e_block_hash lock in the else part.
>
>> return ERR_PTR(-EAGAIN);
>> - }
>> + } else
>> + hlist_bl_unlock(ce->e_block_hash_p);
>> + mb_assert(ce->e_index_hash_p == head);
>> return ce;
>> }
>> l = l->next;
>
>
> Cheers,
>
> - Ted
>

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/