[PATCH 02/30] habanalabs: fix race when waiting on encaps signal

From: Oded Gabbay
Date: Sat Jan 22 2022 - 14:57:46 EST


From: Dani Liberman <dliberman@xxxxxxxxx>

Scenario:
1. CS which is part of encaps signal has been completed and now
executing kref_put to its encaps signal handle. The refcount of the
handle decremented to 0, and called the encaps signal handle
release function - hl_encaps_handle_do_release.

2. At this point the user starts waiting on the signal, and finds the
encaps signal handle in the handlers list and increment the habdle
refcount to 1.

3. Immediately after, hl_encaps_handle_do_release removed the handle
from the list and free its memory.

4. Wait function using the handle although it has been freed.

This scenario caused the slab area which was previously allocated
for the handle to be poison overwritten which triggered kernel bug
the next time the OS needed to allocate this slab.

Fixed by getting the refcount of the handle and validate it only if
the refcount was not zero before.

Signed-off-by: Dani Liberman <dliberman@xxxxxxxxx>
Reviewed-by: Oded Gabbay <ogabbay@xxxxxxxxxx>
Signed-off-by: Oded Gabbay <ogabbay@xxxxxxxxxx>
---
drivers/misc/habanalabs/common/command_submission.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/common/command_submission.c b/drivers/misc/habanalabs/common/command_submission.c
index 2f40b937c59f..96ec1fd3882a 100644
--- a/drivers/misc/habanalabs/common/command_submission.c
+++ b/drivers/misc/habanalabs/common/command_submission.c
@@ -2063,13 +2063,22 @@ static int cs_ioctl_signal_wait(struct hl_fpriv *hpriv, enum hl_cs_type cs_type,
idp = &ctx->sig_mgr.handles;
idr_for_each_entry(idp, encaps_sig_hdl, id) {
if (encaps_sig_hdl->cs_seq == signal_seq) {
- handle_found = true;
/* get refcount to protect removing
* this handle from idr, needed when
* multiple wait cs are used with offset
* to wait on reserved encaps signals.
*/
kref_get(&encaps_sig_hdl->refcount);
+ /*
+ * Since kref_put of this handle is executed outside the
+ * current lock, it is possible that the handle refcount
+ * is 0 but it yet to be removed from the list. In this case
+ * need to consider the handle as not valid. To ensure
+ * that the handle is valid, its refcount must be bigger
+ * than 1.
+ */
+ if (kref_read(&encaps_sig_hdl->refcount) > 1)
+ handle_found = true;
break;
}
}
--
2.25.1