Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

From: Jerome Glisse
Date: Mon Nov 08 2010 - 17:01:54 EST


On Mon, Nov 8, 2010 at 3:58 PM, Rafael J. Wysocki <rjw@xxxxxxx> wrote:
> On Monday, November 08, 2010, Jerome Glisse wrote:
>> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
>> <markus@xxxxxxxxxxxxxxx> wrote:
>> > On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>> >> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>> >> > On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
>> >> > > I can trigger a kernel crash on my system by simply loading this png
>> >> > > image with firefox:
>> >> > > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>> >> >
>> >> > Sorry the above link is wrong, this is the right one (that triggers the
>> >> > crash):
>> >> > http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>> >>
>> >> I triggered it a few more times and took the attached picture.
>> >> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>> >> (Sorry for the bad picture quality)
>> >
>> > And here the same BUG in plaintext (should be a bit easier to read):
>> >
>> > Nov  8 19:28:23 arch kernel: ------------[ cut here ]------------
>> > Nov  8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
>> > Nov  8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP
>> > Nov  8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
>> > Nov  8 19:28:23 arch kernel: CPU 1
>> > Nov  8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
>> > Nov  8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>]  [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>> > Nov  8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8  EFLAGS: 00010246
>> > Nov  8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78
>> > Nov  8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb
>> > Nov  8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040
>> > Nov  8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001
>> > Nov  8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000
>> > Nov  8 19:28:23 arch kernel: FS:  00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000
>> > Nov  8 19:28:23 arch kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Nov  8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0
>> > Nov  8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Nov  8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Nov  8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20)
>> > Nov  8 19:28:23 arch kernel: Stack:
>> > Nov  8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600
>> > Nov  8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4
>> > Nov  8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001
>> > Nov  8 19:28:23 arch kernel: Call Trace:
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> > Nov  8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff
>> > Nov  8 19:28:23 arch kernel: RIP  [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>> > Nov  8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8>
>> > Nov  8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]---
>> > Nov  8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1
>> > Nov  8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>> > Nov  8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G      D     2.6.37-rc1-00116-g151f52f-dirty #31
>> > Nov  8 19:28:23 arch kernel: Call Trace:
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> > Nov  8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>> > Nov  8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G      D     2.6.37-rc1-00116-g151f52f-dirty #31
>> > Nov  8 19:28:23 arch kernel: Call Trace:
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> > Nov  8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> > Nov  8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> > Nov  8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> > Nov  8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> >
>>
>> Thomas this bug seems to point to a case where we endup trying adding
>> an entry to
>> same offset in the rb tree for addr_space_mm. After reviewing
>> carefully the locking
>> around the rb tree modification & addr_space_mm i am fairly confident
>> that no race can
>> occur. Would you have any idea on what might go wrong here ? I guess i would
>> ultimately need to dump mm & rb tree state when BUG get trigger to try
>> to understand
>> states of things.
>
> Hmm, why are you using BUG in there in the first place?  Would it be _so_
> dangerous to continue that we just have to crash here?
>
> Rafael
>

This case should _never happen, i guess we could return an error
and refuse to create bo _but to me it seems that this case is the
result of corrupted rb or mm structure, so everythings might fall
off in more subtle way if we bail out in front of this error.

Jerome
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/