[PATCH 17/30] habanalabs: there is no kernel TDR in future ASICs

From: Oded Gabbay
Date: Sat Jan 22 2022 - 14:58:31 EST


In future ASICs, there is no kernel TDR for new workloads that are
submitted directly from user-space to the device.

Therefore, the driver can NEVER know that a workload has timed-out.

So, when the user asks us to wait for interrupt on the workload's
completion, and the wait has timed-out, it doesn't mean the workload
has timed-out. It only means the wait has timed-out, which is NOT an
error from driver's perspective.

Signed-off-by: Oded Gabbay <ogabbay@xxxxxxxxxx>
---
.../misc/habanalabs/common/command_submission.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/habanalabs/common/command_submission.c b/drivers/misc/habanalabs/common/command_submission.c
index 96ec1fd3882a..0f41c283082c 100644
--- a/drivers/misc/habanalabs/common/command_submission.c
+++ b/drivers/misc/habanalabs/common/command_submission.c
@@ -2941,11 +2941,14 @@ static int _hl_interrupt_wait_ioctl(struct hl_device *hdev, struct hl_ctx *ctx,
rc = -EIO;
*status = HL_WAIT_CS_STATUS_ABORTED;
} else {
- dev_err_ratelimited(hdev->dev, "Waiting for interrupt ID %d timedout\n",
- interrupt->interrupt_id);
- rc = -ETIMEDOUT;
+ /* The wait has timed-out. We don't know anything beyond that
+ * because the workload wasn't submitted through the driver.
+ * Therefore, from driver's perspective, the workload is still
+ * executing.
+ */
+ rc = 0;
+ *status = HL_WAIT_CS_STATUS_BUSY;
}
- *status = HL_WAIT_CS_STATUS_BUSY;
}
}

@@ -3058,6 +3061,12 @@ static int _hl_interrupt_wait_ioctl_user_addr(struct hl_device *hdev, struct hl_
interrupt->interrupt_id);
rc = -EINTR;
} else {
+ /* The wait has timed-out. We don't know anything beyond that
+ * because the workload wasn't submitted through the driver.
+ * Therefore, from driver's perspective, the workload is still
+ * executing.
+ */
+ rc = 0;
*status = HL_WAIT_CS_STATUS_BUSY;
}

--
2.25.1