Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
From: James Hilliard
Date:  Wed Nov 16 2022 - 16:56:34 EST
On Tue, Nov 15, 2022 at 10:05 AM <Sagar.Biradar@xxxxxxxxxxxxx> wrote:
>
> Hi James,
> I have looked into the patch thoroughly.
> We suspect this change might expose an old legacy interrupt issue on some processors.
I did see this error once with this patch when a drive was having issues:
[ 4306.357531] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030025] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030111] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 4335.030172] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
[ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0
[ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0
[ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3
[ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0
[ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0
[ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3
[ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset
[ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded
[ 4365.895079] aacraid: Comm Interface type2 enabled
[ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan
[ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
(14.0 TB/12.7 TiB)
[ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks
[ 5643.714301] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 5672.351532] #PF: supervisor read access in kernel mode
[ 5672.353262] #PF: error_code(0x0000) - not-present page
[ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0
[ 5672.356444] Oops: 0000 [#1] SMP PTI
[ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P           O
     5.15.64-1-pve #1
[ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
05/21/2021
[ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
[ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00
4c 8b
[ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046
[ 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000001
[ 5672.371073] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 5672.373007] RBP: ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001
[ 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12: 0000000000000000
[ 5672.376418] R13: 0000000000000000 R14: ffff88968e1ec0d0 R15: 0000000000000000
[ 5672.378136] FS:  00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
knlGS:0000000000000000
[ 5672.379760] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0
[ 5672.383023] Call Trace:
[ 5672.384673]  <IRQ>
[ 5672.386282]  ? task_tick_fair+0x88/0x530
[ 5672.386469] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 5672.387921]  dma_unmap_sg_attrs+0x32/0x50
[ 5672.391431] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 5672.393273]  scsi_dma_unmap+0x3b/0x50
[ 5672.397079] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (0,1,41,0):
[ 5672.398180]  aac_srb_callback+0x88/0x3c0 [aacraid]
Does that look related?
>
> We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> I will keep you the thread posted as soon as we have something interesting.
>
> Sagar
>
> -----Original Message-----
> From: James Hilliard <james.hilliard1@xxxxxxxxx>
> Sent: Monday, November 14, 2022 12:13 AM
> To: Sagar Biradar - C34249 <Sagar.Biradar@xxxxxxxxxxxxx>
> Cc: martin.petersen@xxxxxxxxxx; khorenko@xxxxxxxxxxxxx; christian@xxxxxxxxxxxxxx; aacraid@xxxxxxxxxxxxx; Don Brace - C33706 <Don.Brace@xxxxxxxxxxxxx>; Tom White - C33503 <Tom.White@xxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx; Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On Thu, Oct 27, 2022 at 1:17 PM <Sagar.Biradar@xxxxxxxxxxxxx> wrote:
> >
> > Hi James and Konstantin,
> >
> > *Limiting the audience to avoid spamming*
> >
> > Sorry for delayed response as I was on vacation.
> > This one got missed somehow as someone else was looking into this and is no longer with the company.
> >
> > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > I will get back to you with some more questions or the confirmation in a day or two max.
>
> Did this ever get looked at?
>
> As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
>
> Vendor aacraid release with this patch merged:
> https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-60001.tgz
>
> >
> >
> > Thanks for your patience.
> > Sagar
> >
> >
> > -----Original Message-----
> > From: James Hilliard <james.hilliard1@xxxxxxxxx>
> > Sent: Thursday, October 27, 2022 1:40 AM
> > To: Martin K. Petersen <martin.petersen@xxxxxxxxxx>
> > Cc: Konstantin Khorenko <khorenko@xxxxxxxxxxxxx>; Christian Großegger
> > <christian@xxxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx; Adaptec OEM
> > Raid Solutions <aacraid@xxxxxxxxxxxxx>; Sagar Biradar - C34249
> > <Sagar.Biradar@xxxxxxxxxxxxx>; Linux Kernel Mailing List
> > <linux-kernel@xxxxxxxxxxxxxxx>; Don Brace - C33706
> > <Don.Brace@xxxxxxxxxxxxx>
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know
> > the content is safe
> >
> > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <khorenko@xxxxxxxxxxxxx> wrote:
> > >
> > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > <martin.petersen@xxxxxxxxxx> wrote:
> > > >>
> > > >>
> > > >> Christian,
> > > >>
> > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017 should be
> > > >>> repaired with Konstantin Khorenko (1):
> > > >>>
> > > >>>    scsi: aacraid: resurrect correct arc ctrl checks for Series-6
> > > >>
> > > >> It would be great to get this patch resubmitted by Konstantin and
> > > >> acked by Microchip.
> >
> > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> >
> > > >
> > > > Does the patch need to be rebased?
> > >
> > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > >
> > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > v3 changes:
> > > >   * introduced another wrapper to check for devices except for Series 6
> > > >     controllers upon request from Sagar Biradar (Microchip)
> > >
> > > Well, back in the year 2019 i've created a bug in RedHat bugzilla
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > (the bug is private, this is default for Redhat bugs)
> > >
> > > In this bug Sagar Biradar (with the email @microchip.com) suggested
> > > me to rework the patch - i've done that and sent the v3.
> > >
> > > And nothing happened after that, but in a ~year (2020-06-19) the bug
> > > was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > >
> > > i suppose S6 is so old that RedHat just does not have customers
> > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > >
> > > Sorry, i was unable to get a final ack from Microchip, i've written
> > > direct emails to the addresses which is found in the internet, tried
> > > to connect via linkedin, no luck.
> > >
> > > --
> > > Konstantin Khorenko