Re: [PATCH] drm/radeon: add a force flush to delay work when radeon

From: Christian König
Date: Wed Aug 17 2022 - 07:40:48 EST


Am 17.08.22 um 09:31 schrieb 李真能:

在 2022/8/15 21:12, Christian König 写道:
Am 15.08.22 um 09:34 schrieb 李真能:

在 2022/8/12 18:55, Christian König 写道:
Am 11.08.22 um 09:25 schrieb Zhenneng Li:
Although radeon card fence and wait for gpu to finish processing current batch rings,
there is still a corner case that radeon lockup work queue may not be fully flushed,
and meanwhile the radeon_suspend_kms() function has called pci_set_power_state() to
put device in D3hot state.

If I'm not completely mistaken the reset worker uses the suspend/resume functionality as well to get the hardware into a working state again.

So if I'm not completely mistaken this here would lead to a deadlock, please double check that.

We have tested many times, there are no deadlock.

Testing doesn't tells you anything, you need to audit the call paths.

In which situation, there would lead to a deadlock?

GPU resets.

Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait for a lockup_work to finish executing the last queueing,  but this kernel func haven't get any lock, and lockup_work will run in another kernel thread, so I think flush_delayed_work could not lead to a deadlock.

Therefor if radeon_gpu_reset is called in another thread when radeon_suspend_kms is blocked on flush_delayed_work, there could not lead to a deadlock.

Ok sounds like you didn't go what I wanted to say.

The key problem is that radeon_gpu_reset() calls radeon_suspend() which in turn calls rdev->asic->suspend().

And this function in turn could end up in radeon_suspend_kms() again, but I'm not 100% sure about that.

Just double check the order of function called here (e.g. if radeon_suspend_kms() call radeon_suspend() or the other way around).  Apart from that your patch looks correct to me as well.

Regards,
Christian.



Regards,
Christian.



Regards,
Christian.

Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
Configuration and Message requests are the only TLPs accepted by a Function in
the D3hot state. All other received Requests must be handled as Unsupported Requests,
and all received Completions may optionally be handled as Unexpected Completions.
This issue will happen in following logs:
Unable to handle kernel paging request at virtual address 00008800e0008010
CPU 0 kworker/0:3(131): Oops 0
pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 Tainted: G        W
pc is at si_gpu_check_soft_reset+0x3c/0x240
ra is at si_dma_is_lockup+0x34/0xd0
v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
s6 = fff00007ef07bd98
a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
gp = ffffffff81d89690  sp = 00000000aa814126
Disabling lock debugging due to kernel taint
Trace:
[<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
[<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
[<ffffffff80977010>] process_one_work+0x280/0x550
[<ffffffff80977350>] worker_thread+0x70/0x7c0
[<ffffffff80977410>] worker_thread+0x130/0x7c0
[<ffffffff80982040>] kthread+0x200/0x210
[<ffffffff809772e0>] worker_thread+0x0/0x7c0
[<ffffffff80981f8c>] kthread+0x14c/0x210
[<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
[<ffffffff80981e40>] kthread+0x0/0x210
  Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
  <88210000> 4821ed21
So force lockup work queue flush to fix this problem.

Signed-off-by: Zhenneng Li <lizhenneng@xxxxxxxxxx>
---
  drivers/gpu/drm/radeon/radeon_device.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index 15692cb241fc..e608ca26780a 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool suspend,
          if (r) {
              /* delay GPU reset to resume */
              radeon_fence_driver_force_completion(rdev, i);
+        } else {
+            /* finish executing delayed work */
+ flush_delayed_work(&rdev->fence_drv[i].lockup_work);
          }
      }