Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk

From: Minchan Kim
Date: Fri Apr 23 2010 - 03:53:57 EST


On Fri, Apr 23, 2010 at 4:17 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> On Fri, 23 Apr 2010 16:00:31 +0900
> Minchan Kim <minchan.kim@xxxxxxxxx> wrote:
>
>> On Fri, Apr 23, 2010 at 2:27 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>> > On Fri, 23 Apr 2010 14:11:37 +0900
>> > Minchan Kim <minchan.kim@xxxxxxxxx> wrote:
>> >
>> >> On Fri, Apr 23, 2010 at 12:01 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>> >> >
>> >> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> >> > hotplug. (but we've got no report to hit this race...)
>> >> >
>> >> > This one is the simplest, I think and works well on my test set.
>> >> > ==
>> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>> >> >
>> >> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> >> > or mapping->i_mmap_lock is held and enter following loop.
>> >> >
>> >> > Â Â Â Âfor_each_vma_in_this_rmap_link(list from page->mapping) {
>> >> > Â Â Â Â Â Â Â Âunsigned long address = vma_address(page, vma);
>> >> > Â Â Â Â Â Â Â Âif (address == -EFAULT)
>> >> > Â Â Â Â Â Â Â Â Â Â Â Âcontinue;
>> >> > Â Â Â Â Â Â Â Â....
>> >> > Â Â Â Â}
>> >> >
>> >> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >> >
>> >> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> >> > can hit a race and may return wrong result.
>> >> >
>> >> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> >> > But for page migration, rmap_walk() has to find all migration_ptes
>> >> > which migration code overwritten valid ptes. This race is critical and cause
>> >> > BUG that a migration_pte is sometimes not removed.
>> >> >
>> >> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> >> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> >> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> >> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> >> > Apr 21 17:27:47 localhost kernel: CPU 3
>> >> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> >> > Apr 21 17:27:47 localhost kernel:
>> >> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G Â Â Â ÂW Â 2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> >> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>] Â[<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> >> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08 ÂEFLAGS: 00010246
>> >> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> >> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> >> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> >> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: FS: Â00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: CS: Â0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> >> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> >> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> >> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> >> > Apr 21 17:27:47 localhost kernel: Stack:
>> >> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> >> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> >> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> >> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> >> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> >> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> >> > Apr 21 17:27:47 localhost kernel: RIP Â[<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> >> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> >> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >> >
>> >> >
>> >> >
>> >> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> >> > under seq counter.
>> >> >
>> >> > Cc: Mel Gorman <mel@xxxxxxxxx>
>> >> > Cc: Minchan Kim <minchan.kim@xxxxxxxxx>
>> >> > Cc: Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx>
>> >> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>> >>
>> >> That's exactly same what I have in mind. :)
>> >> But I am hesitating. That's because AFAIR, we try to remove seqlock. Right?
>> >
>> > Ah,..."don't use seqlock" is trend ?
>> >
>> >> But in this case, seqlock is good, I think. :)
>> >>
>> > BTW, this isn't seqlock but seq_counter :)
>> >
>> > I'm still testing. What I doubt other than vma_address() is fork().
>> > at fork(), followings _may_ happen. (but I'm not sure).
>> >
>> > Â Â Â Âchain vma.
>> > Â Â Â Âcopy page table.
>> > Â Â Â Â Â -> migration entry is copied, too.
>> >
>> > At remap,
>> > Â Â Â Âfor each vma
>> > Â Â Â Â Â Âlook into page table and replace.
>> >
>> > Then,
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ârmap_walk().
>> > Â Â Â Âfork(parent, child)
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âlook into child's page table.
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â=> we fond nothing.
>> > Â Â Â Âspin_lock(child's pagetable);
>> > Â Â Â Âspin_lock(parant's page table);
>> > Â Â Â Âcopy migration entry
>> > Â Â Â Âspin_unlock(paranet's page table)
>> > Â Â Â Âspin_unlock(child's page table)
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âupdate parent's paga table
>> >
>> > If we always find parant's page table before child's , there is no race.
>> > But I can't get prit_tree's list order as clear image. Hmm.
>> >
>> > Thanks,
>> > -Kame
>> >
>>
>> That's good point, Kame.
>> I looked into prio_tree quickly.
>> If I understand it right, list order is backward.
>>
>> dup_mmap calls vma_prio_tree_add.
>>
>> Â* prio_tree_root
>> Â* Â Â Â|
>> Â* Â Â ÂA Â Â Â vm_set.head
>> Â* Â Â / \ Â Â Â/
>> Â* Â ÂL Â R -> H-I-J-K-M-N-O-P-Q-S
>> Â* Â Â^ Â ^ Â Â<-- vm_set.list -->
>> Â* Âtree nodes
>> Â*
>>
>> Maybe, parent and childs's vma are H~S.
>> Then, comment said.
>>
>> "vma->shared.vm_set.parent != NULL Â Â==> a tree node"
>> So vma_prio_tree_add call not list_add_tail but list_add.
>>
> Ah, thank you for explanation.
>
>> Anyway, I think order isn't mixed.
>> So, could we traverse it by backward in rmap?
>>
> Doesn't it make prio-tree code dirty ?
>
> Here is another idea....but ..hmm. Does this make fork() slow in some cases ?

Yes. I think this idea is good to me. :)
Great, Kame.

But as you said, migration is rare.
so we wouldn't lost much performance in many case.

Actually, If I understand prio_tree right, I think backward walking of
prio_tree is nod bad.
I don't think it make code dirty. :)
I admit it's different per people.

I like both ideas.
I passes decision to others. :)

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/