Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

From: Will Deacon
Date: Tue Jul 05 2022 - 06:59:04 EST


On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
> On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> > On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > > That would make sense, from further experiments I ran it somehow seems
> > > > to be related to the number of workers being spawned by stress-ng along
> > > > with the CPUs/cores involved.
> > > >
> > > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > > results in the test running fine for at least 15 minutes.
> > >
> > > Ok. I will let it run for longer on the machines I have access to.
> > >
> > > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > > in nf_conntrack.
> > >
> > > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > > but works with attached patch someone who understands aarch64 memory ordering
> > > would have to look more closely at refcount_XXX functions to see where they
> > > might differ from atomic_ ones.
> >
> > I can confirm that the patch seems to solve the issue.
> > With it applied on top of the 5.19-rc5 tag the test runs fine for at
> > least 15 minutes which was not the case before so it looks like it is
> > that aarch64 memory ordering problem.
>
> I'm CCing some people who should be able to help with aarch64 memory
> ordering, maybe they could take a look.
>
> (re-sending due to a typo in CC, sorry for duplicate emails!)

Sorry, but I have absolutely no context here. We have a handy document
describing the differences between atomic_t and refcount_t:

Documentation/core-api/refcount-vs-atomic.rst

What else do you need to know?

Will