Re: 2.6.29 regression: ATA bus errors on resume

From: Jeff Garzik
Date: Fri Apr 03 2009 - 16:09:30 EST

Next message: Steven Rostedt: "Re: [PATCH v2 2/2] ftrace: Clean up enable logic for sched_switch"
Previous message: Ingo Molnar: "Re: [tip:x86/uv] x86, UV: Fix for nodes with memory and no cpus"
In reply to: Niel Lambrechts: "Re: 2.6.29 regression: ATA bus errors on resume"
Next in thread: Niel Lambrechts: "Re: 2.6.29 regression: ATA bus errors on resume"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Niel Lambrechts wrote:

On 03/30/2009 04:40 PM, Jeff Garzik wrote:
Niel Lambrechts wrote:
On 03/30/2009 11:00 AM, Tejun Heo wrote:
Hello,

For some reason, I can't find the original thread, so replying here.

Niel Lambrechts wrote:
The ext4 errors are interleaved with hardware errors, and the ext4
errors are about I/O errors.

EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
read inode block - inode=2346519
EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO
failure

This looks more like a hibernation problem than an ext4 problem.
Looks like the hard drive is being left in some inconsistent state
after resuming from hibernation.

Yeap, ext4 is just the victim here.

ata1.00: irq_stat 0x00400008, PHY RDY changed
ata1: SError: { PHYRdyChg CommWake }

Your SATA hardware flags a connect-or-disconnect event ("PHY
RDY"), which requires us to abort a bunch of queued commands:

ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA
bus error)

[...]
...
The SCSI subsystem aborts each of the queued commands.
No .. this is the SCSI subsystem receives an ABORTED COMMAND
return in
sense data for each of the outstanding I/Os

The only place these are generated is in ata_sense_to_error()
which only
occurs if there's some type of ata error.

If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and
returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.

In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the
command?

When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().

Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.

How reproducible is the problem? Are you interested in trying out
some debug patches?

Hi Tejun,

I think I should be able to reproduce when actively using X with 2.6.29,
and I have an external disk where I could backup to / boot from if the
corruption became a problem.

These issues are keeping me from 2.6.29 so I'll gladly help where I can,
if you can please provide me the patches and the .config settings that
may be required?

Niel
--
To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Any chance you could use bisect to narrow down the problem commit?

http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt

This should identify which patch caused your problems, if you have a
known good starting point (such as 2.6.28).

I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
can tell the closest thing good kernel I can tell git to use is 2.6.28
base itself. So now what happens is that resume entirely fails during
some of the bisects due to entirely other regressions that are present
in older and newer kernels than mine, so I can't test the real issue! :(

"git help bisect" or "man git-bisect" has a wealth of information.

Most notably, you can use "git bisect skip" if the current commit cannot be tested, and thus cannot be declared good or bad.

Jeff

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Steven Rostedt: "Re: [PATCH v2 2/2] ftrace: Clean up enable logic for sched_switch"
Previous message: Ingo Molnar: "Re: [tip:x86/uv] x86, UV: Fix for nodes with memory and no cpus"
In reply to: Niel Lambrechts: "Re: 2.6.29 regression: ATA bus errors on resume"
Next in thread: Niel Lambrechts: "Re: 2.6.29 regression: ATA bus errors on resume"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]