Re: [PATCH 4.19 000/306] 4.19.87-stable review

From: Lukas Bulwahn
Date: Mon Jan 27 2020 - 16:16:29 EST




On Mon, 27 Jan 2020, Jouni HÃgander wrote:

> Lukas Bulwahn <lukas.bulwahn@xxxxxxxxx> writes:
>
> > On Wed, 22 Jan 2020, Jouni HÃgander wrote:
> >
> >> Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> writes:
> >> >> > Now queued up, I'll push out -rc2 versions with this fix.
> >> >> >
> >> >> > greg k-h
> >> >>
> >> >> We have also been informed about another regression these two commits
> >> >> are causing:
> >> >>
> >> >> https://lore.kernel.org/lkml/ace19af4-7cae-babd-bac5-cd3505dcd874@xxxxxxxxxxxxxxxxxxx/
> >> >>
> >> >> I suggest to drop these two patches from this queue, and give us a
> >> >> week to shake out the regressions of the change, and once ready, we
> >> >> can include the complete set of fixes to stable (probably in a week or
> >> >> two).
> >> >
> >> > Ok, thanks for the information, I've now dropped them from all of the
> >> > queues that had them in them.
> >> >
> >> > greg k-h
> >>
> >> I have now run more extensive Syzkaller testing on following patches:
> >>
> >> cb626bf566eb net-sysfs: Fix reference count leak
> >> ddd9b5e3e765 net-sysfs: Call dev_hold always in rx_queue_add_kobject
> >> e0b60903b434 net-sysfs: Call dev_hold always in netdev_queue_add_kobje
> >> 48a322b6f996 net-sysfs: fix netdev_queue_add_kobject() breakage
> >> b8eb718348b8 net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject
> >>
> >> These patches are fixing couple of memory leaks including this one found
> >> by Syzbot: https://syzkaller.appspot.com/bug?extid=ad8ca40ecd77896d51e2
> >>
> >> I can reproduce these memory leaks in following stable branches: 4.14,
> >> 4.19, and 5.4.
> >>
> >> These are all now merged into net/master tree and based on my testing
> >> they are ready to be taken into stable branches as well.
> >>
> >
> > + syzkaller list
> > Jouni et. al, please drop Linus in further responses; Linus, it was wrong
> > to add you to this thread in the first place (reason is explained below)
> >
> > Jouni, thanks for investigating.
> >
> > It raises the following questions and comments:
> >
> > - Does the memory leak NOT appear on 4.9 and earlier LTS branches (or did
> > you not check that)? If it does not appear, can you bisect it with the
> > reproducer to the commit between 4.14 and 4.9?
>
> I tested and these memory leaks are not reproucible in 4.9 and earlier.
>
> >
> > - Do the reproducers you found with your syzkaller testing show the same
> > behaviour (same bisection) as the reproducers from syzbot?
>
> Yes, they are same.
>
> >
> > - I fear syzbot's automatic bisection on is wrong, and Linus' commit
> > 0e034f5c4bc4 ("iwlwifi: fix mis-merge that breaks the driver") is not to
> > blame here; that commit did not cause the memory leak, but fixed some
> > unrelated issue that simply confuses syzbot's automatic bisection.
> >
> > Just FYI: Dmitry Vyukov's evaluation of the syzbot bisection shows that
> > about 50% are wrong, e.g., due to multiple bugs being triggered with one
> > reproducer and the difficulty of automatically identifying them of being
> > different due to different root causes (despite the smart heuristics of
> > syzkaller & syzbot). So, to identify the actual commit on which the memory
> > leak first appeared, you need to bisect manually with your own judgement
> > if the reported bug stack trace fits to the issue you investigating. Or
> > you use syzbot's automatic bisection but then with a reduced kernel config
> > that cannot be confused by other issues. You might possibly also hit a
> > "beginning of time" in your bisection, where KASAN was simply not
> > supported, then the initially causing commit can simply not determined by
> > bisection with the reproducer and needs some code inspection and
> > archaeology with git. Can you go ahead try to identify the correct commit
> > for this issue?
>
> These two commits (that are not in 4.9 and earlier) are intorducing these leaks:
>
> commit e331c9066901dfe40bea4647521b86e9fb9901bb
> Author: YueHaibing <yuehaibing@xxxxxxxxxx>
> Date: Tue Mar 19 10:16:53 2019 +0800
>
> net-sysfs: call dev_hold if kobject_init_and_add success
>
> [ Upstream commit a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e ]
>
> In netdev_queue_add_kobject and rx_queue_add_kobject,
> if sysfs_create_group failed, kobject_put will call
> netdev_queue_release to decrease dev refcont, however
> dev_hold has not be called. So we will see this while
> unregistering dev:
>
> unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1
>
> Reported-by: Hulk Robot <hulkci@xxxxxxxxxx>
> Fixes: d0d668371679 ("net: don't decrement kobj reference count on init fail
> ure")
> Signed-off-by: YueHaibing <yuehaibing@xxxxxxxxxx>
> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>
> commit d0d6683716791b2a2761a1bb025c613eb73da6c3
> Author: stephen hemminger <stephen@xxxxxxxxxxxxxxxxxx>
> Date: Fri Aug 18 13:46:19 2017 -0700
>
> net: don't decrement kobj reference count on init failure
>
> If kobject_init_and_add failed, then the failure path would
> decrement the reference count of the queue kobject whose reference
> count was already zero.
>
> Fixes: 114cf5802165 ("bql: Byte queue limits")
> Signed-off-by: Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>
> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
>

But, it seems that we now have just a long sequences of fix patches.

This commit from 2011 seems to be the initial buggy one:

commit 114cf5802165ee93e3ab461c9c505cd94a08b800
Author: Tom Herbert <therbert@xxxxxxxxxx>
Date: Mon Nov 28 16:33:09 2011 +0000

bql: Byte queue limits

And then we just have fixes over fixes:

114cf5802165ee93e3ab461c9c505cd94a08b800
fixed by d0d6683716791b2a2761a1bb025c613eb73da6c3
fixed by a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e
fixed by the sequence of your five patches, mentioned above


If that is right, we should be able to find a reproducer with syzkaller on
the versions before d0d668371679 ("net: don't decrement kobj reference
count on init failure") with fault injection enabled or some manually
injected fault by modifying the source code to always fail on init to
really trigger the init failure, and see the reference count go below
zero.

All further issues should also have reproducers found with syzkaller.
If we have a good feeling on the reproducers and this series of fixes
really fixed the issue now here for all cases, we should suggest to
backport all of the fixes to 4.4 and 4.9.

We should NOT just have Greg pick up a subset of the patches and backport
them to 4.4 and 4.9, that will likely break more than it fixes.

Jouni, did you see Greg's bot inform you that he would pick up your latest
patch for 4.4 and 4.9? Please respond to those emails to make sure a
complete set of patches is picked up, which we tested with all those
intermediate reproducers and an extensive syzkaller run hitting the
net-sysfs interface (e.g., by configuring the corpus and check coverage).

If you cannot do this testing for 4.4 and 4.9 now quickly (you
potentially have less than 24 hours), we should hold those new patches
back for 4.4 and 4.9, as none of the fixes seem to be applied at all right
now and the users have not complained yet on 4.4 and 4.9.
Once testing of the whole fix sequence is done, we request to backport all
patches at once for 4.4 and 4.9.

Lukas