Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

From: John Garry
Date: Wed Oct 05 2022 - 04:53:54 EST


On 04/10/2022 15:04, John Garry wrote:
Notes unrelated to this patch:

Both before and after this series, this driver prints:
[ 215.845053] ata21.00: exception Emask 0x0 SAct 0xfc0000 SErr 0x0 action 0x6
[ 215.852308] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.857801] ata21.00: cmd 61/00:00:00:3a:d3/01:00:b3:04:00/40 tag 18 ncq dma 131072 out
res 43/04:00:ff:3a:d3/00:00:b3:04:00/40 Emask 0x400 (NCQ error) <F>
[ 215.874396] ata21.00: status: { DRDY SENSE ERR }
[ 215.879192] ata21.00: error: { ABRT }
[ 215.882997] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.888479] ata21.00: cmd 61/00:00:00:3b:d3/01:00:b3:04:00/40 tag 19 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.904814] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.910311] ata21.00: cmd 61/00:00:00:3c:d3/01:00:b3:04:00/40 tag 20 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.932679] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.941203] ata21.00: cmd 61/00:00:00:3d:d3/01:00:b3:04:00/40 tag 21 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.963616] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.972150] ata21.00: cmd 61/00:00:00:3e:d3/01:00:b3:04:00/40 tag 22 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.994532] ata21.00: failed command: WRITE FPDMA QUEUED
[ 216.003124] ata21.00: cmd 61/00:00:00:3f:d3/01:00:b3:04:00/40 tag 23 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)

HSM (Host State Machine) violation errors.

For the same SATA drive connected via AHCI this will instead give:

[ 3796.944923] ata14.00: exception Emask 0x0 SAct 0x80800003 SErr 0xc0000 action 0x0
[ 3796.959375] ata14.00: irq_stat 0x40000008
[ 3796.970140] ata14: SError: { CommWake 10B8B }
[ 3796.981231] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3796.993237] ata14.00: cmd 61/00:08:00:7e:73/02:00:8e:08:00/40 tag 1 ncq dma 262144 out
res 43/04:01:00:00:00/00:00:00:00:00/40 Emask 0x1 (device error)
[ 3797.017984] ata14.00: status: { DRDY SENSE ERR }
[ 3797.026833] ata14.00: error: { ABRT }
[ 3797.034664] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.043015] ata14.00: cmd 61/00:b8:00:60:73/0a:00:8e:08:00/40 tag 23 ncq dma 1310720 out
res 43/04:00:df:67:73/00:00:8e:08:00/40 Emask 0x400 (NCQ error) <F>
[ 3797.065224] ata14.00: status: { DRDY SENSE ERR }
[ 3797.072914] ata14.00: error: { ABRT }
[ 3797.079598] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.087920] ata14.00: cmd 61/00:f8:00:6a:73/0a:00:8e:08:00/40 tag 31 ncq dma 1310720 out
res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 3797.109800] ata14.00: status: { DRDY SENSE ERR }
[ 3797.117451] ata14.00: error: { ABRT }

device error errors.


Except for the I/O that caused the NCQ error, the remaining outstanding I/Os,
regardless if they were aborted by the drive, as a side-effect of reading the
NCQ error log (see 13.7.4 Queued Error Log (10h) in SATA 3.5a spec),
or if they were aborted by the host (by sas_ata_device_link_abort()),
I don't think it is correct to report these as HSM violation errors.

HSM violation errors are e.g. when you try to issue a command to a drive
that has ATA_BUSY bit set.
We had a similar issue for hisi_sas and solved in patch 4/7: don't set
ATA_ERR in the fis for those IO which complete with error, but abort the
IO via sas_abort_task().

For pm80xx the IO is either rejected (actually completes with rejection)
or is aborted via internal abort command. Maybe we can do similar for
pm8001 as we allow the IO to complete in both cases with error. I'll check.

Hi Niklas,

Could you try a change like this on top:

void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
{
struct ata_port *ap = device->sata_dev.ap;
struct ata_link *link = &ap->link;

+ device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY;
+ device->sata_dev.fis[3] = 0x04;

link->eh_info.err_mask |= AC_ERR_DEV;
if (force_reset)
link->eh_info.action |= ATA_EH_RESET;
ata_link_abort(link);
}
EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);

I tried it myself and it looked to work ok, except I have a problem with my arm64 system in that the read log ext times-out and all TF show "device error", like:

[ 350.257870] ata1.00: qc timeout (cmd 0x47)
[ 350.262054] pm80xx0:: mpi_sata_completion 2293: task null, freeing CCB tag 2
[ 350.269128] ata1.00: Read log 0x10 page 0x00 failed, Emask 0x40

[ 350.281581] ata1: failed to read log page 10h (errno=-5)
[ 350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
[ 350.584836] ata1.00: failed command: READ FPDMA QUEUED
[ 350.589970] ata1.00: cmd 60/00:00:80:26:00/01:00:00:00:00/40 tag 0 ncq dma 131072 in
res 41/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 350.605533] ata1.00: status: { DRDY ERR }
[ 350.609541] ata1.00: error: { ABRT }
[ 350.613115] ata1.00: failed command: READ FPDMA QUEUED
[ 350.618248] ata1.00: cmd 60/00:00:80:26:00/01:00:00:00:00/40 tag 1 ncq dma 131072 in
res 41/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 350.633809] ata1.00: status: { DRDY ERR
[ 350.637813] ata1.00: error: { ABRT }
[ 350.641384] ata1.00: failed command: READ FPDMA QUEUED
[ 350.646515] ata1.00: cmd 60/00:00:80:26:00/01:00:00:00:00/40 tag 2 ncq dma 131072 in
res 41/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 350.662076] ata1.00: status: { DRDY ERR
[ 350.666080] ata1.00: error: { ABRT }
[ 350.669652] ata1.00: failed command: READ FPDMA QUEUED
[ 350.674784] ata1.00: cmd 60/00:00:d8:26:00/01:00:00:00:00/40 tag 3 ncq dma 131072 in
res 41/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 350.690344] ata1.00: status: { DRDY ERR
[ 350.694348] ata1.00: error: { ABRT }
[ 350.697919] ata1.00: failed command: READ FPDMA QUEUED
[ 350.703051] ata1.00: cmd 60/00:00:e0:26:00/01:00:00:00:00/40 tag 4 ncq dma 131072 in
res 41/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 350.718612] ata1.00: status: { DRDY ERR
[ 350.722623] ata1.00: error: { ABRT }
[ 350.726196] ata1.00: failed command: READ FPDMA QUEUED
[ 350.731329] ata1.00: cmd 60/00:00:c8:26:00/01:00:00:00:00/40 tag 5 ncq dma 131072 in
res 41/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)

...


Thanks,
John