Re: failed to set xfermode (err_mask=0x40): READ LOG DMA EXT failed, trying PIO

From: Paul Menzel
Date: Sun May 05 2019 - 04:41:54 EST


Dear Linux folks,


On 01.05.19 11:34, Paul Menzel wrote:

On an MSI B350M MORTAR (MS-7A37) with all firmwares, and the AHCI SSD Crucial MX500, *sometimes* boot and resume are delayed by four to five seconds, and the log contains the message *failed to set xfermode*.

[ÂÂÂ 0.474150] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 0)
[ÂÂÂ 0.474267] microcode: Microcode Update Driver: v2.2.
[ÂÂÂ 0.474294] AVX2 version of gcm_enc/dec engaged.
[ÂÂÂ 0.474295] AES CTR mode by8 optimization enabled
[ÂÂÂ 0.474304] ata9.00: supports DRM functions and may not be fully accessible
[ÂÂÂ 0.474368] ata9.00: ATA-10: CT1000MX500SSD4, M3CR020, max UDMA/133
[ÂÂÂ 0.474370] ata9.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[ÂÂÂ 0.474569] ata9.00: READ LOG DMA EXT failed, trying PIO
[ÂÂÂ 0.474571] ata9.00: failed to get Identify Device Data, Emask 0x40
[ÂÂÂ 0.474572] ata9.00: ATA Identify Device Log not supported
[ÂÂÂ 0.474573] ata9.00: Security Log not supported
[ÂÂÂ 0.474576] ata9.00: failed to set xfermode (err_mask=0x40)
[ÂÂÂ 0.493193] sched_clock: Marking stable (504159603, -10991650)->(615393192, -122225239)
[ÂÂÂ 0.493768] registered taskstats version 1
[ÂÂÂ 0.493770] Loading compiled-in X.509 certificates
[ÂÂÂ 0.497894] Loaded X.509 cert 'Build time autogenerated kernel key: 4682756457c0b7b167f0b726943cf53ade93bd78'
[ÂÂÂ 0.497984] zswap: loaded using pool lzo/zbud
[ÂÂÂ 0.498237] kmemleak: Kernel memory leak detector initialized
[ÂÂÂ 0.498243] kmemleak: Automatic memory scanning thread started
[ÂÂÂ 0.506469] Key type big_key registered
[ÂÂÂ 0.508864]ÂÂ Magic number: 11:580:189
[ÂÂÂ 0.508914] acpi device:48: hash matches
[ÂÂÂ 0.508922] acpi device:1b: hash matches
[ÂÂÂ 0.509017] rtc_cmos 00:02: setting system clock to 2019-03-29T17:09:41 UTC (1553879381)
[ÂÂÂ 0.509485] After kernel_init_freeable
[ÂÂÂ 1.364358] tsc: Refined TSC clocksource calibration: 3499.982 MHz
[ÂÂÂ 1.364370] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x32734013f67, max_idle_ns: 440795323134 ns
[ÂÂÂ 1.364419] clocksource: Switched to clocksource tsc
[ÂÂÂ 5.593141] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 0)
[ÂÂÂ 5.593270] ata9.00: supports DRM functions and may not be fully accessible
[ÂÂÂ 5.594006] ata9.00: supports DRM functions and may not be fully accessible
[ÂÂÂ 5.594600] ata9.00: configured for UDMA/133
[ÂÂÂ 5.595893] scsi 8:0:0:0: Direct-AccessÂÂÂÂ ATAÂÂÂÂÂ CT1000MX500SSD4Â 020Â PQ: 0 ANSI: 5
[ÂÂÂ 5.596611] sd 8:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
[ÂÂÂ 5.596613] sd 8:0:0:0: [sda] 4096-byte physical blocks
[ÂÂÂ 5.596626] sd 8:0:0:0: [sda] Write Protect is off
[ÂÂÂ 5.596629] sd 8:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ÂÂÂ 5.596650] sd 8:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ÂÂÂ 5.596837] sd 8:0:0:0: Attached scsi generic sg0 type 0
[ÂÂÂ 5.597329]Â sda: sda1 sda2 sda3
[ÂÂÂ 5.599079] sd 8:0:0:0: [sda] Attached SCSI disk

This happens with all Linux kernel versions â tested since 4.14. Searching for this error on the Web, several people seem to be affected with different boards and drives [1].

In a lot of those reports, I also see the message *Security Log not supported*.

Do you have an idea, why this is only happening sometimes? Is it likely a drive firmware issue? As it affects several vendors, can Linux work around this problem (if it can be pinpointed)?

The first error seems to be `READ LOG DMA EXT failed, trying PIO`. The code looks like below [2].

retry:
ata_tf_init(dev, &tf);
if (dev->dma_mode && ata_id_has_read_log_dma_ext(dev->id) &&
!(dev->horkage & ATA_HORKAGE_NO_DMA_LOG)) {
tf.command = ATA_CMD_READ_LOG_DMA_EXT;
tf.protocol = ATA_PROT_DMA;
dma = true;
} else {
tf.command = ATA_CMD_READ_LOG_EXT;
tf.protocol = ATA_PROT_PIO;
dma = false;
}
tf.lbal = log;
tf.lbam = page;
tf.nsect = sectors;
tf.hob_nsect = sectors >> 8;
tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_LBA48 | ATA_TFLAG_DEVICE;

err_mask = ata_exec_internal(dev, &tf, NULL, DMA_FROM_DEVICE,
buf, sectors * ATA_SECT_SIZE, 0);

if (err_mask && dma) {
dev->horkage |= ATA_HORKAGE_NO_DMA_LOG;
ata_dev_warn(dev, "READ LOG DMA EXT failed, trying PIO\n");
goto retry;
}

DPRINTK("EXIT, err_mask=%x\n", err_mask);
return err_mask;

So when trying PIO, that also seems to fail, and the error mask 0x40 is propagated up.

enum ata_completion_errors {
AC_ERR_DEV = (1 << 0), /* device reported error */
AC_ERR_HSM = (1 << 1), /* host state machine violation */
AC_ERR_TIMEOUT = (1 << 2), /* timeout */
AC_ERR_MEDIA = (1 << 3), /* media error */
AC_ERR_ATA_BUS = (1 << 4), /* ATA bus error */
AC_ERR_HOST_BUS = (1 << 5), /* host bus error */
AC_ERR_SYSTEM = (1 << 6), /* system error */
AC_ERR_INVALID = (1 << 7), /* invalid argument */
AC_ERR_OTHER = (1 << 8), /* unknown */
AC_ERR_NODEV_HINT = (1 << 9), /* polling device detection hint */
AC_ERR_NCQ = (1 << 10), /* marker for offending NCQ qc */
};

So 0x40 is AC_ERR_HOST_BUS, right? Any idea, what might cause this?

Would it be helpful to enable the debug messages?

#undef ATA_DEBUG /* debugging output */


Kind regards,

Paul


[1]: https://bugzilla.kernel.org/show_bug.cgi?id=195895[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?id=1daa0449d287a109b93c4516914eddeff4baff65#n2075
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/libata.h?id=1daa0449d287a109b93c4516914eddeff4baff65#n508