Re: [REGRESSION] lxc-stop hang on 5.17.x kernels

From: Pavel Begunkov
Date: Tue May 03 2022 - 10:14:33 EST


On 5/3/22 08:37, Daniel Harding wrote:
[Resend with a smaller trace]

On 5/3/22 02:14, Pavel Begunkov wrote:
On 5/2/22 19:49, Daniel Harding wrote:
On 5/2/22 20:40, Pavel Begunkov wrote:
On 5/2/22 18:00, Jens Axboe wrote:
On 5/2/22 7:59 AM, Jens Axboe wrote:
On 5/2/22 7:36 AM, Daniel Harding wrote:
On 5/2/22 16:26, Jens Axboe wrote:
On 5/2/22 7:17 AM, Daniel Harding wrote:
I use lxc-4.0.12 on Gentoo, built with io-uring support
(--enable-liburing), targeting liburing-2.1.  My kernel config is a
very lightly modified version of Fedora's generic kernel config. After
moving from the 5.16.x series to the 5.17.x kernel series, I started
noticed frequent hangs in lxc-stop.  It doesn't happen 100% of the
time, but definitely more than 50% of the time. Bisecting narrowed
down the issue to commit aa43477b040251f451db0d844073ac00a8ab66ee:
io_uring: poll rework. Testing indicates the problem is still present
in 5.18-rc5. Unfortunately I do not have the expertise with the
codebases of either lxc or io-uring to try to debug the problem
further on my own, but I can easily apply patches to any of the
involved components (lxc, liburing, kernel) and rebuild for testing or
validation.  I am also happy to provide any further information that
would be helpful with reproducing or debugging the problem.
Do you have a recipe to reproduce the hang? That would make it
significantly easier to figure out.

I can reproduce it with just the following:

     sudo lxc-create --n lxc-test --template download --bdev dir --dir /var/lib/lxc/lxc-test/rootfs -- -d ubuntu -r bionic -a amd64
     sudo lxc-start -n lxc-test
     sudo lxc-stop -n lxc-test

The lxc-stop command never exits and the container continues running.
If that isn't sufficient to reproduce, please let me know.

Thanks, that's useful! I'm at a conference this week and hence have
limited amount of time to debug, hopefully Pavel has time to take a look
at this.

Didn't manage to reproduce. Can you try, on both the good and bad
kernel, to do:

Same here, it doesn't reproduce for me
OK, sorry it wasn't something simple.
# echo 1 > /sys/kernel/debug/tracing/events/io_uring/enable

run lxc-stop

# cp /sys/kernel/debug/tracing/trace ~/iou-trace

so we can see what's going on? Looking at the source, lxc is just using
plain POLL_ADD, so I'm guessing it's not getting a notification when it
expects to, or it's POLL_REMOVE not doing its job. If we have a trace
from both a working and broken kernel, that might shed some light on it.
It's late in my timezone, but I'll try to work on getting those traces tomorrow.

I think I got it, I've attached a trace.

What's interesting is that it issues a multi shot poll but I don't
see any kind of cancellation, neither cancel requests nor task/ring
exit. Perhaps have to go look at lxc to see how it's supposed
to work

Yes, that looks exactly like my bad trace.  I've attached good trace (captured with linux-5.16.19) and a bad trace (captured with linux-5.17.5).  These are the differences I noticed with just a visual scan:

* Both traces have three io_uring_submit_sqe calls at the very beginning, but in the good trace, there are further io_uring_submit_sqe calls throughout the trace, while in the bad trace, there are none.
* The good trace uses a mask of c3 for io_uring_task_add much more often than the bad trace:  the bad trace uses a mask of c3 only for the very last call to io_uring_task_add, but a mask of 41 for the other calls.
* In the good trace, many of the io_uring_complete calls have a result of 195, while in the bad trace, they all have a result of 1.

I don't know whether any of those things are significant or not, but that's what jumped out at me.

I have also attached a copy of the script I used to generate the traces.  If there is anything further I can to do help debug, please let me know.

Good observations! thanks for traces.

It sounds like multi-shot poll requests were getting downgraded
to one-shot, which is a valid behaviour and was so because we
didn't fully support some cases. If that's the reason, than
the userspace/lxc is misusing the ABI. At least, that's the
working hypothesis for now, need to check lxc.

--
Pavel Begunkov