RE: [bisected] 5.12-rc1 hpsa regression: "scsi: hpsa: Correct dev cmds outstanding for retried cmds" breaks hpsa P600

From: Don.Brace
Date: Wed Mar 03 2021 - 14:07:11 EST


-----Original Message-----
From: Sergei Trofimovich [mailto:slyich@xxxxxxxxx]
Sent: Wednesday, March 3, 2021 2:56 AM
To: John Paul Adrian Glaubitz <glaubitz@xxxxxxxxxxxxxxxxxxx>; Don Brace - C33706 <Don.Brace@xxxxxxxxxxxxx>; storagedev <storagedev@xxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx
Cc: linux-ia64@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Joe Szczypek <jszczype@xxxxxxxxxx>; Scott Benesh - C33703 <Scott.Benesh@xxxxxxxxxxxxx>; Scott Teel - C33730 <Scott.Teel@xxxxxxxxxxxxx>; Tomas Henzl <thenzl@xxxxxxxxxx>; Martin K. Petersen <martin.petersen@xxxxxxxxxx>
Subject: Re: [bisected] 5.12-rc1 hpsa regression: "scsi: hpsa: Correct dev cmds outstanding for retried cmds" breaks hpsa P600

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On Wed, 3 Mar 2021 00:22:36 +0000
Sergei Trofimovich <slyich@xxxxxxxxx> wrote:

> On Tue, 2 Mar 2021 23:31:32 +0100
> John Paul Adrian Glaubitz <glaubitz@xxxxxxxxxxxxxxxxxxx> wrote:
>
> > Hi Sergei!
> >
> > On 3/2/21 11:26 PM, Sergei Trofimovich wrote:
> > > Gave v5.12-rc1 a try today and got a similar boot failure around
> > > hpsa queue initialization, but my failure is later:
> > > https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1
> > > Maybe I get different error because I flipped on most debugging
> > > kernel options :)
> > >
> > > Looks like 'ERROR: Invalid distance value range' while being very
> > > scary are harmless. It's just a new spammy way for kernel to
> > > report lack of NUMA config on the machine (no SRAT and SLIT ACPI
> > > tables).
> > >
> > > At least I get hpsa detected on PCI bus. But I guess it's
> > > discovered configuration is very wrong as I get unaligned accesses:
> > > [ 19.811570] kernel unaligned access to 0xe000000105dd8295, ip=0xa000000100b874d1

Running pahole before the patch:

struct CommandList {
struct CommandListHeader Header; /* 0 20 */
struct RequestBlock Request; /* 20 20 */
struct ErrDescriptor ErrDesc; /* 40 12 */
struct SGDescriptor SG[32]; /* 52 512 */
/* --- cacheline 8 boundary (512 bytes) was 52 bytes ago --- */
u32 busaddr; /* 564 4 */
struct ErrorInfo * err_info; /* 568 8 */
/* --- cacheline 9 boundary (576 bytes) --- */
struct ctlr_info * h; /* 576 8 */
int cmd_type; /* 584 4 */
long int cmdindex; /* 588 8 */
struct completion * waiting; /* 596 8 */
struct scsi_cmnd * scsi_cmd; /* 604 8 */
struct work_struct work; /* 612 32 */
/* --- cacheline 10 boundary (640 bytes) was 4 bytes ago --- */
struct hpsa_scsi_dev_t * phys_disk; /* 644 8 */
int abort_pending; /* 652 4 */
struct hpsa_scsi_dev_t * device; /* 656 8 */
atomic_t refcount; /* 664 4 */

/* size: 768, cachelines: 12, members: 16 */
/* padding: 100 */
} __attribute__((__aligned__(128)));

Pahole after the patch:

struct CommandList {
struct CommandListHeader Header; /* 0 20 */
struct RequestBlock Request; /* 20 20 */
struct ErrDescriptor ErrDesc; /* 40 12 */
struct SGDescriptor SG[32]; /* 52 512 */
/* --- cacheline 8 boundary (512 bytes) was 52 bytes ago --- */
u32 busaddr; /* 564 4 */
struct ErrorInfo * err_info; /* 568 8 */
/* --- cacheline 9 boundary (576 bytes) --- */
struct ctlr_info * h; /* 576 8 */
int cmd_type; /* 584 4 */
long int cmdindex; /* 588 8 */
struct completion * waiting; /* 596 8 */
struct scsi_cmnd * scsi_cmd; /* 604 8 */
struct work_struct work; /* 612 32 */
/* --- cacheline 10 boundary (640 bytes) was 4 bytes ago --- */
struct hpsa_scsi_dev_t * phys_disk; /* 644 8 */
struct hpsa_scsi_dev_t * device; /* 652 8 */
bool retry_pending; /* 660 1 */
atomic_t refcount; /* 661 4 */

/* size: 768, cachelines: 12, members: 16 */
/* padding: 103 */
} __attribute__((__aligned__(128)));

So, I did replace abort_pending field (an int) with retry_pending (bool)
Can I send you a patch to just rename abort_pending to retry_pending using the same type and position?
It will mean some minor code changes in the driver...

With the above changes, pahole is the same
struct CommandList {
struct CommandListHeader Header; /* 0 20 */
struct RequestBlock Request; /* 20 20 */
struct ErrDescriptor ErrDesc; /* 40 12 */
struct SGDescriptor SG[32]; /* 52 512 */
/* --- cacheline 8 boundary (512 bytes) was 52 bytes ago --- */
u32 busaddr; /* 564 4 */
struct ErrorInfo * err_info; /* 568 8 */
/* --- cacheline 9 boundary (576 bytes) --- */
struct ctlr_info * h; /* 576 8 */
int cmd_type; /* 584 4 */
long int cmdindex; /* 588 8 */
struct completion * waiting; /* 596 8 */
struct scsi_cmnd * scsi_cmd; /* 604 8 */
struct work_struct work; /* 612 32 */
/* --- cacheline 10 boundary (640 bytes) was 4 bytes ago --- */
struct hpsa_scsi_dev_t * phys_disk; /* 644 8 */
int retry_pending; /* 652 4 */
struct hpsa_scsi_dev_t * device; /* 656 8 */
atomic_t refcount; /* 664 4 */

/* size: 768, cachelines: 12, members: 16 */
/* padding: 100 */
} __attribute__((__aligned__(128)));


> > >
> > > Bisecting now.
> >
> > Sounds good. I guess we should get Jens' fix for the signal
> > regression merged as well as your two fixes for strace.
>
> "bisected" (cheated halfway through) and verified that reverting
> f749d8b7a9896bc6e5ffe104cc64345037e0b152 makes rx3600 boot again.
>
> CCing authors who might be able to help us here.
>
> commit f749d8b7a9896bc6e5ffe104cc64345037e0b152
> Author: Don Brace <don.brace@xxxxxxxxxxxxx>
> Date: Mon Feb 15 16:26:57 2021 -0600
>
> scsi: hpsa: Correct dev cmds outstanding for retried cmds
>
> Prevent incrementing device->commands_outstanding for ioaccel command
> retries that are driver initiated. If the command goes through the retry
> path, the device->commands_outstanding counter has already accounted for
> the number of commands outstanding to the device. Only commands going
> through function hpsa_cmd_resolve_events decrement this counter.
>
> - ioaccel commands go to either HBA disks or to logical volumes comprised
> of SSDs.
>
> The extra increment is causing device resets to hang.
>
> - Resets wait for all device outstanding commands to complete before
> returning.
>
> Replace unused field abort_pending with retry_pending. This is a
> maintenance driver so these changes have the least impact/risk.
>
> Link: https://lore.kernel.org/r/161342801747.29388.13045495968308188518.stgit@brunhilda
> Tested-by: Joe Szczypek <jszczype@xxxxxxxxxx>
> Reviewed-by: Scott Benesh <scott.benesh@xxxxxxxxxxxxx>
> Reviewed-by: Scott Teel <scott.teel@xxxxxxxxxxxxx>
> Reviewed-by: Tomas Henzl <thenzl@xxxxxxxxxx>
> Signed-off-by: Don Brace <don.brace@xxxxxxxxxxxxx>
> Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx>
>
> Don, do you happen to know why this patch caused some controller init
> failure for device
> 14:01.0 RAID bus controller: Hewlett-Packard Company Smart Array
> P600 ?
>
> Boot failure:
> https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1
> Boot success:
> https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1-good
>
> The difference between the two boots is
> f749d8b7a9896bc6e5ffe104cc64345037e0b152 reverted on top of 5.12-rc1
> in -good case.
>
> Looks like hpsa controller fails to initialize in bad case (could be a race?).

Also CCing hpsa maintainer mailing lists.

Looking more into the suspect commit
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f749d8b7a9896bc6e5ffe104cc64345037e0b152
it roughly does the:

@@ -448,7 +448,7 @@ struct CommandList {
*/
struct hpsa_scsi_dev_t *phys_disk;

- int abort_pending;
+ bool retry_pending;
struct hpsa_scsi_dev_t *device;
atomic_t refcount; /* Must be last to avoid memset in hpsa_cmd_init() */ } __aligned(COMMANDLIST_ALIGNMENT); ...
@@ -1151,7 +1151,10 @@ static void __enqueue_cmd_and_start_io(struct ctlr_info *h, {
dial_down_lockup_detection_during_fw_flash(h, c);
atomic_inc(&h->commands_outstanding);
- if (c->device)
+ /*
+ * Check to see if the command is being retried.
+ */
+ if (c->device && !c->retry_pending)
atomic_inc(&c->device->commands_outstanding);

But I don't immediately see anything wrong with it.

--

Sergei

Attachment: hpsa-correct-dev-cmd-outstanding-for-retried-cmds-alignment-update
Description: hpsa-correct-dev-cmd-outstanding-for-retried-cmds-alignment-update