Re: Intermittent performance regression related to ipset between 5.10 and 5.15

From: Jozsef Kadlecsik
Date: Sat Jul 02 2022 - 13:40:53 EST


Hi,

On Thu, 30 Jun 2022, Jakub Kicinski wrote:

> Sounds like you're pretty close to figuring this out! Can you check
> if the user space is intentionally setting IPSET_ATTR_INITVAL?
> Either that or IPSET_ATTR_GC was not as "unused" as initially thought.

IPSET_ATTR_GC was really unused. It was an old remnant from the time when
ipset userspace-kernel communication was through set/getsockopt. However,
when it was migrated to netlink, just the symbol was kept but it was not
used either with the userspace tool or the kernel.

Aaron, could you send me how to reproduce the issue? I have no idea how
that patch could be the reason. Setting/getting/using IPSET_ATTR_INITVAL
is totally independent from listing iptables rules. But if you have got a
reproducer then I can dig into it.

Best regards,
Jozsef

> Testing something like this could be a useful data point:
>
> diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
> index 6397d75899bc..7caf9b53d2a7 100644
> --- a/include/uapi/linux/netfilter/ipset/ip_set.h
> +++ b/include/uapi/linux/netfilter/ipset/ip_set.h
> @@ -92,7 +92,7 @@ enum {
> /* Reserve empty slots */
> IPSET_ATTR_CADT_MAX = 16,
> /* Create-only specific attributes */
> - IPSET_ATTR_INITVAL, /* was unused IPSET_ATTR_GC */
> + IPSET_ATTR_GC,
> IPSET_ATTR_HASHSIZE,
> IPSET_ATTR_MAXELEM,
> IPSET_ATTR_NETMASK,
> @@ -104,6 +104,8 @@ enum {
> IPSET_ATTR_REFERENCES,
> IPSET_ATTR_MEMSIZE,
>
> + IPSET_ATTR_INITVAL,
> +
> __IPSET_ATTR_CREATE_MAX,
> };
> #define IPSET_ATTR_CREATE_MAX (__IPSET_ATTR_CREATE_MAX - 1)
>
>
> On Thu, 30 Jun 2022 14:59:14 +0000 U'ren, Aaron wrote:
> > Thorsten / Jozsef -
> >
> > Thanks for continuing to follow up! I'm sorry that this has moved so slow, it has taken us a bit to find the time to fully track this issue down, however, I think that we have figured out enough to make some more forward progress on this issue.
> >
> > Jozsef, thanks for your insight into what is happening between those system calls. In regards to your question about wait/wound mutex debugging possibly being enabled, I can tell you that we definitely don't have that enabled on any of our regular machines. While we were debugging we did turn on quite a few debug options to help us try and track this issue down and it is very possible that the strace that was taken that started off this email was taken on a machine that did have that debug option enabled. Either way though, the root issue occurs on hosts that definitely do not have wait/wound mutex debugging enabled.
> >
> > The good news is that we finally got one of our development environments into a state where we could reliably reproduce the performance issue across reboots. This was a win because it meant that we were able to do a full bisect of the kernel and were able to tell relatively quickly whether or not the issue was present in the test kernels.
> >
> > After bisecting for 3 days, I have been able to narrow it down to a single commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3976ca101990ca11ddf51f38bec7b86c19d0ca6f (netfilter: ipset: Expose the initval hash parameter to userspace)
> >
> > I'm at a bit of a loss as to why this would cause such severe performance regressions, but I've proved it out multiple times now. I've even checked out a fresh version of the 5.15 kernel that we've been deploying with just this single commit reverted and found that the performance problems are completely resolved.
> >
> > I'm hoping that maybe Jozsef will have some more insight into why this seemingly innocuous commit causes such larger performance issues for us? If you have any additional patches or other things that you would like us to test I will try to leave our environment in its current state for the next couple of days so that we can do so.
> >
> > -Aaron
> >
> > From: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>
> > Date: Monday, June 20, 2022 at 2:16 AM
> > To: U'ren, Aaron <Aaron.U'ren@xxxxxxxx>
> > Cc: McLean, Patrick <Patrick.Mclean@xxxxxxxx>, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>, netfilter-devel@xxxxxxxxxxxxxxx <netfilter-devel@xxxxxxxxxxxxxxx>, Brown, Russell <Russell.Brown@xxxxxxxx>, Rueger, Manuel <manuel.rueger@xxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx <linux-kernel@xxxxxxxxxxxxxxx>, regressions@xxxxxxxxxxxxxxx <regressions@xxxxxxxxxxxxxxx>, Florian Westphal <fw@xxxxxxxxx>, netdev@xxxxxxxxxxxxxxx <netdev@xxxxxxxxxxxxxxx>, Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx>
> > Subject: Re: Intermittent performance regression related to ipset between 5.10 and 5.15
> > On 31.05.22 09:41, Jozsef Kadlecsik wrote:
> > > On Mon, 30 May 2022, Thorsten Leemhuis wrote:
> > >> On 04.05.22 21:37, U'ren, Aaron wrote:
> > [...]
> > >
> > > Every set lookups behind "iptables" needs two getsockopt() calls: you can
> > > see them in the strace logs. The first one check the internal protocol
> > > number of ipset and the second one verifies/gets the processed set (it's
> > > an extension to iptables and therefore there's no internal state to save
> > > the protocol version number).
> >
> > Hi Aaron! Did any of the suggestions from Jozsef help to track down the
> > root case? I have this issue on the list of tracked regressions and
> > wonder what the status is. Or can I mark this as resolved?
> >
> > Side note: this is not a "something breaks" regressions and it seems to
> > progress slowly, so I'm putting it on the backburner:
> >
> > #regzbot backburner: performance regression where the culprit is hard to
> > track down
> >
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >
> > P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > reports and sometimes miss something important when writing mails like
> > this. If that's the case here, don't hesitate to tell me in a public
> > reply, it's in everyone's interest to set the public record straight.
> >
> > [...]
> > >
> > > In your strace log
> > >
> > > 0.000024 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\0\1\0\0\7\0\0\0", [8]) = 0 <0.000024>
> > > 0.000046 getsockopt(4, SOL_IP, 0x53 /* IP_??? */, "\7\0\0\0\7\0\0\0KUBE-DST-VBH27M7NWLDOZIE"..., [40]) = 0 <0.1$
> > > 0.109456 close(4)                  = 0 <0.000022>
> > >
> > > the only things which happen in the second sockopt function are to lock
> > > the NFNL_SUBSYS_IPSET mutex, walk the array of the sets, compare the
> > > setname, save the result in the case of a match and unlock the mutex.
> > > Nothing complicated, no deep, multi-level function calls. Just a few line
> > > of codes which haven't changed.
> > >
> > > The only thing which can slow down the processing is the mutex handling.
> > > Don't you have accidentally wait/wound mutex debugging enabled in the
> > > kernel? If not, then bisecting the mutex related patches might help.
> > >
> > > You wrote that flushing tables or ipsets didn't seem to help. That
> > > literally meant flushing i.e. the sets were emptied but not destroyed? Did
> > > you try both destroying or flushing?
> > >
> > >> Jozsef, I still have this issue on my list of tracked regressions and it
> > >> looks like nothing happens since above mail (or did I miss it?). Could
> > >> you maybe provide some guidance to Aaron to get us all closer to the
> > >> root of the problem?
> > >
> > > I really hope it's an accidentally enabled debugging option in the kernel.
> > > Otherwise bisecting could help to uncover the issue.
> > >
> > > Best regards,
> > > Jozsef
> > >
> > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > >> reports and sometimes miss something important when writing mails like
> > >> this. If that's the case here, don't hesitate to tell me in a public
> > >> reply, it's in everyone's interest to set the public record straight.
>

-
E-mail : kadlec@xxxxxxxxxxxxxxxxx, kadlecsik.jozsef@xxxxxxxxx
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
H-1525 Budapest 114, POB. 49, Hungary