Re: [PATCH V2 rdma-next 3/4] RDMA/hns: Add reset process for RoCE in hip08

From: Jason Gunthorpe
Date: Fri May 25 2018 - 10:56:21 EST


On Fri, May 25, 2018 at 01:54:31PM +0800, Wei Hu (Xavier) wrote:
>
>
> On 2018/5/25 5:31, Jason Gunthorpe wrote:
> >> static const struct hnae3_client_ops hns_roce_hw_v2_ops = {
> >> .init_instance = hns_roce_hw_v2_init_instance,
> >> .uninit_instance = hns_roce_hw_v2_uninit_instance,
> >> + .reset_notify = hns_roce_hw_v2_reset_notify,
> >> };
> >>
> >> static struct hnae3_client hns_roce_hw_v2_client = {
> >> diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
> >> index 1b79a38..ac51372 100644
> >> +++ b/drivers/infiniband/hw/hns/hns_roce_main.c
> >> @@ -332,6 +332,9 @@ static struct ib_ucontext *hns_roce_alloc_ucontext(struct ib_device *ib_dev,
> >> struct hns_roce_ib_alloc_ucontext_resp resp = {};
> >> struct hns_roce_dev *hr_dev = to_hr_dev(ib_dev);
> >>
> >> + if (!hr_dev->active)
> >> + return ERR_PTR(-EAGAIN);
> > This still doesn't make sense, ib_unregister_device already makes sure
> > that hns_roce_alloc_ucontext isn't running and won't be called before
> > returning, don't need another flag to do that.
> >
> > Since this is the only place the active flag is tested it can just be deleted
> > entirely.
> Hi, Jason
>
> roce reset process:
> 1. hr_dev->active = false; //make sure no any process call
> ibv_open_device.
> 2. call ib_dispatch_event() function to report IB_EVENT_DEVICE_FATAL
> event.
> 3. msleep(100); // for some app to free resources
> 4. call ib_unregister_device().
> 5. ...
> 6. ...
>
> There are 2 steps as above before calling ib_unregister_device(), we
> evaluate
> hr_dev->active with false to avoid no any process call
> ibv_open_device.

If you think this is the right flow then it is core issue to block new
opens, not an individual driver issue, send a core patch - eg add a
'ib_driver_fatal_error()' call that does the dispatch and call it from
all the drivers using this IB_EVENT_DEVICE_FATAL

I'm not completely sure this makes sense though, it might be better
for the core code to force stuff a IB_EVENT_DEVICE_FATAL to contexts
that open after the fatal event..

Jason