low and behold, i found that this was all caused by improper scsi termination
which i'm sort of suprised at because the controller didn't bitch about it
at boot while initializing. <shrug>
-kelley
On Mon, Feb 11, 2002 at 12:02:27PM -0600, kelley eicher wrote:
> i'm having some problems with the 2.4.17 linux kernel on a dual athlon
> system that i was hoping someone could shed some light on. there seem
> to be multiple problems so i'm not quite sure which path to follow at this
> point.
>
> the scenario is that i have a system with the following hardware:
>
> # awk '/\(/' /proc/pci
> Host bridge: PCI device 1022:700c (Advanced Micro Devices [AMD]) (rev 17).
> PCI bridge: PCI device 1022:700d (Advanced Micro Devices [AMD]) (rev 0).
> ISA bridge: Advanced Micro Devices [AMD] AMD-768 [??] ISA (rev 4).
> IDE interface: Advanced Micro Devices [AMD] AMD-768 [??] IDE (rev 4).
> Bridge: Advanced Micro Devices [AMD] AMD-768 [??] ACPI (rev 3).
> SCSI storage controller: Adaptec 7892A (rev 2).
> PCI bridge: Advanced Micro Devices [AMD] AMD-768 [??] PCI (rev 4).
> VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 133).
> Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 116).
>
> while this machine is at work i see aic7xxx_abort errors in dmesg from time
> to time during heavy i/o. interestingly, this does not crash the machine or
> the devices in question but recovers with an aic7xxx_dev_reset instruction
> after 1-2 minutes of abort attempts.
>
> during the time between the apparent scsi failures i see a few error
> messages in the form of 'eth0: Too much work in interrupt, status e401.'
>
> looking at /proc/interrupts i see that indeed, the eth0 device is hard at
> work.
>
> # cat /proc/interrupts
> CPU0 CPU1
> 0: 33669019 33490567 IO-APIC-edge timer
> 1: 19117 19797 IO-APIC-edge keyboard
> 2: 0 0 XT-PIC cascade
> 10: 32635348 32632811 IO-APIC-level eth0
> 11: 381721 381874 IO-APIC-level aic7xxx
> 14: 236054 249997 IO-APIC-edge ide0
> 15: 0 6 IO-APIC-edge ide1
> NMI: 0 0
> LOC: 67153036 67153815
> ERR: 0
> MIS: 32
>
> i have researched the 'eth0: Too much work in interrupt, status e401.' a bit
> and found that it is possible to increase the threshold for which these
> errors will be printed. i did not attempt this because it does not seem that
> it should be a solution to this problem but more of a crutch. i.e. bad things
> still happen, you just don't see them.
>
> another reason i refrained from making any adjustments to settings for the
> driver is that i have an almost identical system in a very similar load
> and role that exhibits *none* of the problems mentioned.
>
> # awk '/\)/' /proc/pci
> Host bridge: PCI device 1022:700c (Advanced Micro Devices [AMD]) (rev 17).
> PCI bridge: PCI device 1022:700d (Advanced Micro Devices [AMD]) (rev 0).
> ISA bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ISA (rev 2).
> IDE interface: Advanced Micro Devices [AMD] AMD-765 [Viper] IDE (rev 1).
> Bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ACPI (rev 1).
> USB Controller: Advanced Micro Devices [AMD] AMD-765 [Viper] USB (rev 7).
> SCSI storage controller: Adaptec 7892A (rev 2).
> SCSI storage controller: Adaptec 7892A (#2) (rev 2).
> Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 116).
> VGA compatible controller: nVidia Corporation Riva TnT2 [NV5] (rev 17).
>
> this second machine runs an identical 2.4.17 kernel to that of the first.
>
> the most significant difference i see here is that the chipset is the amd
> 760mp rather than 760mpx which is purely a supposed improvement to the
> south bridge 765->768.
>
> so before i go tearing machines apart in hopes of debugging which piece of
> hardware is the cause of this less than optimal behavior, would anyone care
> to wager what the cause is?
>
> -kelley
>
This archive was generated by hypermail 2b29 : Fri Feb 15 2002 - 21:00:56 EST