Re: Drives missing at boot

From: Tejun Heo
Date: Wed Jul 07 2010 - 11:48:46 EST


Hello,

On 07/07/2010 05:34 PM, Mark Knecht wrote:
> OK - I don't know if this was you intention but since adding this
> patch I've not had a single drive missing failure. I've cold booted
> about 8 times and warm booted at least 20 times. Every one has come up
> fine. I've even gone so far as to turn off the UPS and sit for 5
> minutes before cold booting. Still nothing fails right now.
>
> I've had this sort of statistical thing happen before where it hasn't
> failed for days, maybe even weeks, but then it starts failing and
> fails every time for awhile. Over the past few days working with you
> I've never had to reboot more than twice to get you a file. Now I've
> tried 30 times this morning and I've come up with nothing.
>
> I will continue to watch the machine and send you the failing dmesg
> file whenever I finally get it. For now I can only attach the passing
> file showing the patch is now included.

It seems like SIDPR is a bit more unreliable than the current code can
handle and the added delay and read could have affected the result.
Eh... weird. Can you please apply the attached patch instead? The
only difference is it will print out two SControl values instead of
one. ie. "XXX SControl after resume = AAA BBB, tries=T". Can you
please try to boot multiple times and see if AAA and BBB differ
anytime? If that happens, please attach the boot log. Also, if you
see one with T > 1, please attach that one too.

Thanks.

--
tejun
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2984e45..ce87bfe 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3712,7 +3712,7 @@ int sata_link_resume(struct ata_link *link, const unsigned long *params,
unsigned long deadline)
{
int tries = ATA_LINK_RESUME_TRIES;
- u32 scontrol, serror;
+ u32 scontrol, scontrol1, serror;
int rc;

if ((rc = sata_scr_read(link, SCR_CONTROL, &scontrol)))
@@ -3739,6 +3739,14 @@ int sata_link_resume(struct ata_link *link, const unsigned long *params,
return rc;
} while ((scontrol & 0xf0f) != 0x300 && --tries);

+ /* check once more */
+ msleep(100);
+ if ((rc = sata_scr_read(link, SCR_CONTROL, &scontrol1)))
+ return rc;
+ ata_link_printk(link, KERN_ERR,
+ "XXX SControl after resume = %X %X, tries=%d\n",
+ scontrol, scontrol1, ATA_LINK_RESUME_TRIES - tries + 1);
+
if ((scontrol & 0xf0f) != 0x300) {
ata_link_printk(link, KERN_ERR,
"failed to resume link (SControl %X)\n",
@@ -6007,7 +6015,7 @@ static void async_port_probe(void *data, async_cookie_t cookie)

ehi->probe_mask |= ATA_ALL_DEVICES;
ehi->action |= ATA_EH_RESET | ATA_EH_LPM;
- ehi->flags |= ATA_EHI_NO_AUTOPSY | ATA_EHI_QUIET;
+ ehi->flags |= ATA_EHI_NO_AUTOPSY/* | ATA_EHI_QUIET*/;

ap->pflags &= ~ATA_PFLAG_INITIALIZING;
ap->pflags |= ATA_PFLAG_LOADING;