Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS

From: Dave Chinner
Date: Mon Jun 29 2020 - 19:55:44 EST


On Sat, Jun 27, 2020 at 08:30:42PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 27, 2020 at 08:32:54AM +1000, Dave Chinner wrote:
> > Observation from the outside:
> >
> > "However I'm having trouble convincing myself that's actually
> > possible on x86_64.... "
>
> Using the weaker rules of LKMM (as relevant to Power) I could in fact
> make it happen, the 'problem' is that it's being observed on the much
> stronger x86_64.

Yes, I understand just enough about the LKMM(*) for this statement
to scare the crap out of me. :(

(*) "understand just enough" == write litmus tests to attempt to
validate memory barrier constructs I dream up....

> > Having looked at this code over the past 24 hours and the recent
> > history, I know that understanding it - let alone debugging and
> > fixing problem in it - is way beyond my capabilities. And I say
> > that as an experienced kernel developer with a pretty good grasp
> > of concurrent programming and a record of implementing a fair
> > number of non-trivial lockless algorithms over the years....
>
> All in the name of making it go fast, I suppose. It used to be
> much simpler... like much of the kernel.

Yup, and we're creating our own code maintenance nightmare as we go.

> The biggest problem I had with this thing was that the reproduction case
> we had (Paul's rcutorture) wouldn't readily trigger on my machines
> (altough it did, but at a much lower rate, just twice in a week's worth
> of runtime).
>
> Also; I'm sure you can spot a problem in the I/O layer much faster than
> I possibly could :-)

Sure, but that misses the point I was making.

I regularly have to look deep into other subsystems to work out what
problem the filesystem is tripping over. I'm regularly
looking into parts of the IO stack, memory management, page
allocators, locking and atomics, workqueues, the scheduler, etc
because XFS makes extensive (and complex) use of the infrastructure
they provide. That means to debug filesystem issues, I have to be
able to understand what that infrastructure is trying to do and make
judgements as to whether that code behaving correctly or not.

And so when I find a reproducer for a bug that takes 20s to
reproduce and it points me at code that I honestily have no hope of
understanding well enough to determine if it is working correctly or
not, then we have a problem. A lot of my time is spent doing root
cause analysis proving that such issues are -not- filesystem
problems (they just had "xfs" in the stack trace), hence being able
to read and understand the code in related core subsystems is
extremely important to performing my day job.

If more kernel code falls off the memory barrier cliff like this,
then the ability of people like me to find the root cause of complex
issues is going to be massively reduced. Writing code so smart that
almost no-one else can understand has always been a bad thing, and
memory barriers only make this problem worse... :(

> Anyway, let me know if you still observe any problems.

Seems to be solid so far. Thanks Peter!

Cheers,

Dave.

--
Dave Chinner
david@xxxxxxxxxxxxx