Re: [RFC PATCH net v2 1/2] net/smc: Resolve the race between link group access and termination

From: Wen Gu
Date: Wed Jan 05 2022 - 03:27:58 EST


Thanks for your reply.

On 2022/1/3 6:36 pm, Karsten Graul wrote:
On 31/12/2021 10:44, Wen Gu wrote:
On 2021/12/29 8:56 pm, Karsten Graul wrote:
On 28/12/2021 16:13, Wen Gu wrote:
We encountered some crashes caused by the race between the access
and the termination of link groups.
What do you think about it?


Hi Wen,

thank you, and I also wish you and your family a happy New Year!

Thanks for your detailed explanation, you convinced me of your idea to use
a reference counting! I think its a good solution for the various problems you describe.

I am still thinking that even if you saw no problems when conn->lgr is not NULL when the lgr
is already terminated there should be more attention on the places where conn->lgr is checked.

Thank you for reminding. I agree with the concern.

It should be improved to avoid the potential issue we haven't found.

For example, in smc_cdc_get_slot_and_msg_send() there is a check for !conn->lgr with the intention
to avoid working with a terminated link group.
Should all checks for !conn->lgr be now replaced by the check for conn->freed ?? Does this make sense?

In my humble opinion, we can replace !conn->lgr with !conn->alert_token_local.

If a smc connection is registered to a link group successfully by smc_lgr_register_conn(),
conn->alert_token_local is set to non-zero. At this moment, the conn->lgr is ready to be used.

And if the link group is terminated, conn->alert_token_local is reset to zero in smc_lgr_unregister_conn(),
meaning that the link group registered to connection shouldn't be used anymore.

So I think checking conn->alert_token_local has the same effect with checking conn->lgr to
identify whether the link group pointed by conn->lgr is still healthy and able to be used.

What do you think about it? :)

Thanks,
Wen Gu