Re: Radeon RS780 - BUG: unable to handle kernel NULL pointerdereference

From: Michel Dänzer
Date: Tue Nov 09 2010 - 05:41:16 EST


On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
> On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
> > On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
> >> OK I've found the buggy commit by bisection:
> >>
> >> e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
> >> commit e376573f7267390f4e1bdc552564b6fb913bce76
> >> Author: Michel DÃnzer<daenzer@xxxxxxxxxx>
> >> Date: Thu Jul 8 12:43:28 2010 +1000
> >>
> >> drm/radeon: fall back to GTT if bo creation/validation in VRAM
> >> fails.
> >>
> >> This fixes a problem where on low VRAM cards we'd run out of
> >> space for validation.
> >>
> >> [airlied: Tested on my M7, Thinkpad T42, compiz works with no
> >> problems.]
> >>
> >> Signed-off-by: Michel DÃnzer<daenzer@xxxxxxxxxx>
> >> Cc: stable@xxxxxxxxxx
> >> Signed-off-by: Dave Airlie<airlied@xxxxxxxxxx>
> >>
> >> Please note that this is an old commit from 2.6.36-rc. When I revert
> >> it the
> >> kernel no longer crashes. Instead I see the following in my dmesg:
> >>
> >
> > Hmm, so this sounds like something in the Radeon eviction error path
> > is causing corruption.
> > I had a similar problem with vmwgfx, when I tried to unref a BO
> > _after_ ttm_bo_init() failed.
> > ttm_bo_init() is really supposed to call unref itself for various
> > reasons, so calling unref() or kfree() after a failed ttm_bo_init()
> > will cause corruption.
> >
> > In any case, the error below also suggests something is a bit fragile
> > in the Radeon driver:
> >
> > First, an accelerated eviction may fail, like in the message below,
> > but then there must always be a backup plan, like unaccelerated
> > eviction to system. On BO creation, there are a number of placement
> > strategies, but if all else fails, it should be possible to initially
> > place the BO in system memory.
> >
> > Second, If bo validation fails during a command submission, due to
> > insufficient VRAM / TT, then the driver should retry the complete
> > validation cycle after first blocking all other validators and then
> > evicting everything not pinned, to avoid failures due to fragmentation.
> >
> > /Thomas
> >
>
> Indeed, it seems like the commit you mention just retries ttm_bo_init()
> after it previously failed. At that point the bo has been destroyed, so
> that is probably what's causing the BUG you are seeing.
>
> Admittedly, ttm_bo_init() calling unref on failure is not properly
> documented in the function description. The reason for doing so is to
> have a single path for freeing all BO resources already allocated on the
> point of failure.

Does the patch below fix the problem?


commit e224472eedbda391ddb6d8b88f26e82e1c3b036b
Author: Michel DÃnzer <daenzer@xxxxxxxxxx>
Date: Tue Nov 9 11:30:41 2010 +0100

drm/radeon/kms: Fix retrying ttm_bo_init() after it failed once.

If ttm_bo_init() returns failure, it already destroyed the BO, so we need to
retry from scratch.

Signed-off-by: Michel DÃnzer <daenzer@xxxxxxxxxx>
Cc: stable@xxxxxxxxxx

diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index 1b9004e..bbe92d5 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -102,6 +102,8 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj,
type = ttm_bo_type_device;
}
*bo_ptr = NULL;
+
+retry:
bo = kzalloc(sizeof(struct radeon_bo), GFP_KERNEL);
if (bo == NULL)
return -ENOMEM;
@@ -109,8 +111,6 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj,
bo->gobj = gobj;
bo->surface_reg = -1;
INIT_LIST_HEAD(&bo->list);
-
-retry:
radeon_ttm_placement_from_domain(bo, domain);
/* Kernel allocation are uninterruptible */
mutex_lock(&rdev->vram_mutex);


--
Earthling Michel DÃnzer | http://www.vmware.com
Libre software enthusiast | Debian, X and DRI developer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/