Huge unreliability - does Linux have something to do with it?

From: jerome lacoste
Date: Fri Feb 04 2005 - 04:09:25 EST


[Sorry for the sensational title]

I have had this laptop for three years. It ran Linux (Debian unstable)
from the start and its hardware has been very unreliable: I changed
hard disks twice and the motherboard thrice. My DVD drive started
failing some days ago (this one is 'original', 3 years old). But I
don't mind as I am not under warranty anymore... This morning the
machine booted with fsck errors on my hard disk. I am not sure if I
did the right thing, but I said clear the inodes, and I ended up
loosing some programs(*) (du, dircolors, etc..). The day starts well
isn't it? Sounds like I will have to switch disks again...

I halted the machine correctly yesterday night. I never dropped the
box in 3 years. Am I just being unlucky? Or could the fact that I am
using Linux on the box affect the reliability in some ways on that
particular hardware (Dell Inspiron 8100)? I run Linux on 3 other
computers and never had single problems with them.

How can the file system (ext3) be messed up the way it was this
morning after I stopped the machine correctly yesterday?
Could a hardware failure look like bad sectors to fsck?

Attached the output of smartctl -a /dev/hda, whatever that helps.

Jerome

(*) I accept tips on discovering and maybe recovering which files have
been taken out of my system...
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: HITACHI_DK23FB-60
Serial Number: 1ZX822
Firmware Version: 00M0A0C1
Device is: In smartctl database [for details use: -P show]
ATA Version is: 5
ATA Standard is: ATA/ATAPI-5 T13 1321D revision 3
Local Time is: Fri Feb 4 09:53:50 2005 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (2150) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 37) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000d 091 090 050 Pre-fail Offline - 412316862542
2 Throughput_Performance 0x0005 100 092 050 Pre-fail Offline - 3140
3 Spin_Up_Time 0x0007 100 100 050 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 388
5 Reallocated_Sector_Ct 0x0033 095 095 010 Pre-fail Always - 142
7 Seek_Error_Rate 0x000f 100 100 050 Pre-fail Always - 651
8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 1125
9 Power_On_Minutes 0x0032 095 095 000 Old_age Always - 2512h+02m
10 Spin_Retry_Count 0x0013 100 100 050 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 361
191 G-Sense_Error_Rate 0x000a 100 068 000 Old_age Always - 17536255
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 25
193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 53398/53372
194 Temperature_Celsius 0x0022 072 036 000 Old_age Always - 54 (Lifetime Min/Max 72/13)
195 Hardware_ECC_Recovered 0x001a 100 001 000 Old_age Always - 4189
196 Reallocated_Event_Count 0x0032 086 086 000 Old_age Always - 142
197 Current_Pending_Sector 0x0032 097 094 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0013 100 100 050 Pre-fail Always - 0
201 Soft_Read_Error_Rate 0x0012 100 100 000 Old_age Always - 1
223 Load_Retry_Count 0x0012 100 100 000 Old_age Always - 0
230 Head_Amplitude 0x0032 096 096 000 Old_age Always - 141669
250 Read_Error_Retry_Rate 0x000a 100 001 000 Old_age Always - 837

SMART Error Log Version: 1
ATA Error Count: 108 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 108 occurred at disk power-on lifetime: 2511 hours (104 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 07 60 08 fb e1 Error: UNC 7 sectors at LBA = 0x01fb0860 = 33228896

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 5f 08 fb e1 00 00:02:32.020 READ DMA
c8 00 08 57 08 fb e1 00 00:02:32.010 READ DMA
c8 00 08 4f 08 fb e1 00 00:02:32.010 READ DMA
c8 00 08 47 08 fb e1 00 00:02:31.980 READ DMA
c8 00 08 3f 08 fb e1 00 00:02:31.880 READ DMA

Error 107 occurred at disk power-on lifetime: 2511 hours (104 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 3d 08 fb e1 Error: UNC 1 sectors at LBA = 0x01fb083d = 33228861

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 06 38 08 fb e1 00 00:01:33.860 READ DMA
c8 00 07 37 08 fb e1 00 00:01:31.960 READ DMA
c8 00 01 3e 08 fb e1 00 00:01:31.830 READ DMA
c8 00 02 3d 08 fb e1 00 00:01:30.120 READ DMA
c8 00 03 3c 08 fb e1 00 00:01:28.350 READ DMA

Error 106 occurred at disk power-on lifetime: 2511 hours (104 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 07 37 08 fb e1 Error: UNC 7 sectors at LBA = 0x01fb0837 = 33228855

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 07 37 08 fb e1 00 00:01:31.960 READ DMA
c8 00 01 3e 08 fb e1 00 00:01:31.830 READ DMA
c8 00 02 3d 08 fb e1 00 00:01:30.120 READ DMA
c8 00 03 3c 08 fb e1 00 00:01:28.350 READ DMA
c8 00 04 3b 08 fb e1 00 00:01:26.090 READ DMA

Error 105 occurred at disk power-on lifetime: 2511 hours (104 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 02 3d 08 fb e1 Error: UNC 2 sectors at LBA = 0x01fb083d = 33228861

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 02 3d 08 fb e1 00 00:01:30.120 READ DMA
c8 00 03 3c 08 fb e1 00 00:01:28.350 READ DMA
c8 00 04 3b 08 fb e1 00 00:01:26.090 READ DMA
c8 00 05 3a 08 fb e1 00 00:01:24.160 READ DMA
c8 00 06 39 08 fb e1 00 00:01:21.870 READ DMA

Error 104 occurred at disk power-on lifetime: 2511 hours (104 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 02 3d 08 fb e1 Error: UNC 2 sectors at LBA = 0x01fb083d = 33228861

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 03 3c 08 fb e1 00 00:01:28.350 READ DMA
c8 00 04 3b 08 fb e1 00 00:01:26.090 READ DMA
c8 00 05 3a 08 fb e1 00 00:01:24.160 READ DMA
c8 00 06 39 08 fb e1 00 00:01:21.870 READ DMA
c8 00 07 38 08 fb e1 00 00:01:19.790 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2488 -
# 2 Short offline Completed without error 00% 2464 -
# 3 Short offline Completed without error 00% 2441 -
# 4 Extended offline Completed: read failure 20% 2408 92491576
# 5 Short offline Completed without error 00% 2407 -
# 6 Short offline Completed without error 00% 2383 -
# 7 Short offline Completed without error 00% 2296 -
# 8 Extended offline Completed: read failure 20% 2273 92491576
# 9 Short offline Completed without error 00% 2272 -
#10 Short offline Completed without error 00% 2237 -
#11 Short offline Completed without error 00% 2221 -
#12 Short offline Completed without error 00% 2190 -
#13 Short offline Completed without error 00% 2171 -
#14 Short offline Completed without error 00% 2137 -
#15 Short offline Completed without error 00% 2113 -
#16 Short offline Completed without error 00% 2090 -
#17 Extended offline Completed: read failure 20% 2067 92491576
#18 Short offline Completed without error 00% 2066 -
#19 Short offline Completed without error 00% 2044 -
#20 Short offline Completed without error 00% 2020 -
#21 Short offline Completed without error 00% 1996 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.