Re: Re: [PATCH] ena: Speed up initialization 90x by reducing poll delays

From: Josh Triplett
Date: Fri Mar 13 2020 - 08:28:30 EST


On Wed, Mar 11, 2020 at 01:24:17PM +0000, Jubran, Samih wrote:
> Hi Josh,
>
> Thanks for taking the time to write this patch. I have faced a bug while testing it that I haven't pinpointed yet the root cause of the issue, but it seems to me like a race in the netlink infrastructure.
>
> Here is the bug scenario:
> 1. created ac c5.24xlarge instance in AWS in v_virginia region using the default amazon Linux 2 AMI
> 2. apply your patch won top of net-next v5.2 and install the kernel (currently I'm able to boot net-next v5.2 only, higher versions of net-next suffer from errors during boot time)
> 3. run "rmmod ena && insmod ena.ko" twice
>
> Result:
> The interface is not in up state
>
> Expected result:
> The interface should be in up state
>
> What I know so far:
> * ena_probe() seems to finish with no errors whatsoever
> * adding prints / delays to ena_probe() causes the bug to vanish or less likely to occur depending on the amount of delays I add
> * ena_up() is not called at all when the bug occurs, so it's something to do with netlink not invoking dev_open()
>
> Did you face such issues? Do you have any idea what might be causing this?

I haven't observed anything like this. I didn't test with Amazon Linux
2, though.

To rule out some possibilities, could you try disabling *all* userspace
networking bits, so that userspace does nothing with a newly discovered
interface, and then testing again? (The interface wouldn't be "up" in
that case, but it should still have a link detected.)

If that works, then I wonder if the userspace used in Amazon Linux 2
might have some kind of race where it's still using the previous
incarnation of the device when you rmmod and insmod? Perhaps the
previous delays made it difficult or impossible to trigger that race?

- Josh Triplett