Re: [Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime and could not connect to lvmetad at some boot attempts

From: Hans de Goede
Date: Mon Mar 19 2018 - 05:33:06 EST


Hi,

On 18-03-18 23:06, Martin Steigerwald wrote:
Hi Hans.

Hans de Goede - 18.03.18, 22:34:
On 14-03-18 13:48, Martin Steigerwald wrote:
Hans de Goede - 14.03.18, 12:05:
Hi,

On 14-03-18 12:01, Martin Steigerwald wrote:
Hans de Goede - 11.03.18, 15:37:
Hi Martin,

On 11-03-18 09:20, Martin Steigerwald wrote:
Hello.

Since 4.16-rc4 (upgraded from 4.15.2 which worked) I have an issue
with SMART checks occassionally failing like this:

smartd[28017]: Device: /dev/sdb [SAT], is in SLEEP mode, suspending
checks
udisksd[24408]: Error performing housekeeping for drive
/org/freedesktop/UDisks2/drives/INTEL_SSDSA2CW300G3_[â]: Error
updating
SMART data: Error sending ATA command CHECK POWER MODE: Unexpected
sense
data returned:#0120000: 0e 09 0c 00 00 00 ff 00 00 00 00 00 00 00
50
00 ..............P.#0120010: 00 00 00 00 00 00 00 00 00 00 00 00
00
00 00 00 ................#012 (g-io-error-quark, 0) merkaba
udisksd[24408]: Error performing housekeeping for drive
/org/freedesktop/UDisks2/drives/Crucial_CT480M500SSD3_[â]: Error
updating
SMART dat a: Error sending ATA command CHECK POWER MODE: Unexpected
sense
data returned:#0120000: 01 00 1d 00 00 00 0e 09 0c 00 00 00 ff 00
00
00 ................#0120010: 00 0 0 00 00 50 00 00 00 00 00 00 00
00 00 00 00 ....P...........#012 (g-io-error-quark, 0)

(Intel SSD is connected via SATA, Crucial via mSATA in a ThinkPad
T520)

However when I then check manually with smartctl -a | -x | -H the
device
reports SMART data just fine.

As smartd correctly detects that device is in sleep mode, this may be
an
userspace issue in udisksd.

Also at some boot attempts the boot hangs with a message like "could
not
connect to lvmetad, scanning manually for devices". I use BTRFS RAID 1
on to LVs (each on one of the SSDs). A configuration that requires a
manual
adaption to InitRAMFS in order to boot (basically vgchange -ay before
btrfs device scan).

I wonder whether that has to do with the new SATA LPM policy stuff,
but
as
I had issues with

3 => Medium power with Device Initiated PM enabled

(machine did not boot, which could also have been caused by me
accidentally
removing all TCP/IP network support in the kernel with that setting)

I set it back to

CONFIG_SATA_MOBILE_LPM_POLICY=0

(firmware settings)

Right, so at that settings the LPM policy changes are effectively
disabled and cannot explain your SMART issues.

Still I would like to zoom in on this part of your bug report, because
for Fedora 28 we are planning to ship with
CONFIG_SATA_MOBILE_LPM_POLICY=3
and AFAIK Ubuntu has similar plans.

I suspect that the issue you were seeing with
CONFIG_SATA_MOBILE_LPM_POLICY=3 were with the Crucial disk ? I've
attached
a patch for you to test, which disabled LPM for your model Crucial SSD
(but
keeps it on for the Intel disk) if you can confirm that with that patch
you
can run with
CONFIG_SATA_MOBILE_LPM_POLICY=3 without issues that would be great.

With 4.16-rc5 with CONFIG_SATA_MOBILE_LPM_POLICY=3 the system
successfully
booted three times in a row. So feel free to add tested-by.

Thanks.

To be clear, you're talking about 4.16-rc5 with the patch I made to
blacklist the Crucial disk I assume, not just plain 4.16-rc5, right ?

4.16-rc5 with your

0001-libata-Apply-NOLPM-quirk-to-Crucial-M500-480GB-SSDs.patch

I was about to submit this upstream and was planning on extending it to
also cover the 960GB version, which lead to me doing a quick google.
Judging from the google results it seems that there are multiple firmware
versions of this SSD out there and I wonder if you are perhaps running
an older version of the firmware. If you do:

dmesg | grep Crucial_CT480M500

You should see something like this:

ata2.00: ATA-9: Crucial_CT480M500SSD3, MU03, max UDMA/133

I'm interested in the "MU03" part, what is that in your case?

Although I never updated the firmware, I do have MU03:

% lsscsi | grep Crucial
[2:0:0:0] disk ATA Crucial_CT480M50 MU03 /dev/sdb

% dmesg | grep Crucial_CT480M500
[ 2.424537] ata3.00: ATA-9: Crucial_CT480M500SSD3, MU03, max UDMA/133

Thanks. So there is an MU05 update:

www.crucial.com/wcsstore/CrucialSAS/firmware/M500/MU05/crucial-m500-iso-firmware-update-mu05-en.pdf

Which according to its changelog features:
"Improved drive latency performance in applications with SMART polling"

Which is not relevant to the LPM issues you are seeing, but seems relevant to
the other issues you are seeing.

Unfortunately the MU05 update does not seem to specifically address any
LPM issues, so I'm just going to do the blacklist for all 480GB+ models
for now (my experience with other Crucial models is that smaller variants
seem to not suffer from LPM issues).

Regards,

Hans