Re: [git pull] drm merge for 3.9-rc1

From: Alex Deucher
Date: Thu Feb 28 2013 - 10:10:14 EST


On Thu, Feb 28, 2013 at 8:44 AM, Josh Boyer <jwboyer@xxxxxxxxx> wrote:
> On Thu, Feb 28, 2013 at 8:38 AM, Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
>> On Wed, Feb 27, 2013 at 8:14 PM, Josh Boyer <jwboyer@xxxxxxxxx> wrote:
>>> On Wed, Feb 27, 2013 at 7:01 PM, Josh Boyer <jwboyer@xxxxxxxxx> wrote:
>>>> On Wed, Feb 27, 2013 at 3:20 PM, Josh Boyer <jwboyer@xxxxxxxxx> wrote:
>>>>> On Wed, Feb 27, 2013 at 11:34 AM, Josh Boyer <jwboyer@xxxxxxxxx> wrote:
>>>>>> On Mon, Feb 25, 2013 at 7:05 PM, Dave Airlie <airlied@xxxxxxxx> wrote:
>>>>>>> Alex Deucher (29):
>>>>>>> drm/radeon: halt engines before disabling MC (6xx/7xx)
>>>>>>> drm/radeon: halt engines before disabling MC (evergreen)
>>>>>>> drm/radeon: halt engines before disabling MC (cayman/TN)
>>>>>>> drm/radeon: halt engines before disabling MC (si)
>>>>>>> drm/radeon: use the reset mask to determine if rings are hung
>>>>>>
>>>>>> Something in this series of commits is causing the GPU to hang on reboot
>>>>>> on my Dell XPS 8300 machine. That has a:
>>>>>>
>>>>>> 01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee
>>>>>> ATI Caicos [Radeon HD 6450]
>>>>>>
>>>>>> card in it. After reboots, I get a screen that looks like this:
>>>>>>
>>>>>> http://t.co/tPnT6xQZUK
>>>>>>
>>>>>> I can hit it fairly consistently after a few reboots, so I tried doing a
>>>>>> git bisect on the radeon driver and it came down to:
>>>>>>
>>>>>> ca57802e521de54341efc8a56f70571f79ffac72 is the first bad commit
>>>>>
>>>>> So I don't think that's actually the cause of the problem. Or at least
>>>>> not that alone. I reverted it on top of Linus' latest tree and I still
>>>>> get the lockups.
>>>>
>>>> Actually, git bisect does seem to have gotten it correct. Once I
>>>> actually tested the revert of just that on top of Linus' tree (commit
>>>> d895cb1af1), things seem to be working much better. I've rebooted a
>>>> dozen times without a lockup. The most I've seen it take on a kernel
>>>> with that commit included is 3 reboots, so that's definitely at least an
>>>> improvement.
>>>
>>> I give up. GPU issues are not my thing. 2 reboots after I sent that it
>>> gave me pretty rainbow static again. So it might have been an
>>> improvement, but revert it is not a solution.
>>>
>>> Looking at there rest of the commits, the whole GPU rework might be
>>> suspect, but I clearly have no clue.
>>
>> GPUs are tricky beasts :)
>
> Understatement ;).
>
>> ca57802e521de54341efc8a56f70571f79ffac72 mostly likely wasn't the
>> problem anyway since it only affects 6xx/7xx and your card is handled
>> by the evergreen code. I'll put together some patches to help narrow
>> down the problem.
>
> Yeah, that's the biggest problem I have, not knowing which functions are
> actually being executed for this card. It looks like a combination of
> stuff in evergreen.c and ni.c, but I have no idea.
>
> Patches would be great. If nothing else, I'm really good at building
> kernels and rebooting by now.

Two possible fixes attached. The first attempts a full reset of all
blocks if the MC (memory controller) is hung. That may work better
than just resetting the MC. The second just disables MC reset. I'm
not sure we can reliably tell if it's busy due to display requests
hitting the MC periodically which would lead to needlessly resetting
it possibly leading to failures like you are seeing.

Alex
From 9a648b04474ed230601c3c3e816cb281ebaad604 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Thu, 28 Feb 2013 09:56:48 -0500
Subject: [PATCH] drm/radeon: XXX try a full reset if the MC is busy

See if this helps.

Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
drivers/gpu/drm/radeon/evergreen.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c b/drivers/gpu/drm/radeon/evergreen.c
index 3c38ea4..bbcac11 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -2438,6 +2438,12 @@ static u32 evergreen_gpu_check_soft_reset(struct radeon_device *rdev)
if (tmp & L2_BUSY)
reset_mask |= RADEON_RESET_VMC;

+ /* reset everything if we attempt to reset the MC */
+ if (reset_mask & RADEON_RESET_MC) {
+ dev_info(rdev->dev, "MC busy: 0x%08X, resetting ALL\n", reset_mask);
+ reset_mask = 0xffffffff;
+ }
+
return reset_mask;
}

--
1.7.7.5

From 834c26ab02e3581ea97b39a90fc0637e7becfa67 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Thu, 28 Feb 2013 10:03:08 -0500
Subject: [PATCH] drm/radeon: XXX skip MC reset as it's probably not hung

The MC is mostly likely busy (e.g., display requests), not hung
so no need to reset it. Doing an MC reset is tricky and not
particularly reliable.

Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
drivers/gpu/drm/radeon/evergreen.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/radeon/evergreen.c b/drivers/gpu/drm/radeon/evergreen.c
index 3c38ea4..0f15ada 100644
--- a/drivers/gpu/drm/radeon/evergreen.c
+++ b/drivers/gpu/drm/radeon/evergreen.c
@@ -2438,6 +2438,12 @@ static u32 evergreen_gpu_check_soft_reset(struct radeon_device *rdev)
if (tmp & L2_BUSY)
reset_mask |= RADEON_RESET_VMC;

+ /* Skip MC reset as it's mostly likely not hung, just busy */
+ if (reset_mask & RADEON_RESET_MC) {
+ dev_info(rdev->dev, "MC busy: 0x%08X, clearing.\n", reset_mask);
+ reset_mask &= ~RADEON_RESET_MC;
+ }
+
return reset_mask;
}

--
1.7.7.5