Re: [Intel-gfx] kernel 3.11.6 general protection fault

From: Borislav Petkov
Date: Sun Nov 17 2013 - 07:08:22 EST


On Sun, Nov 17, 2013 at 12:35:16PM +0100, MPhil. Emanoil Kotsev wrote:
> After doing all of this I was able to reproduce the issue by
> overloading the system with following simple steps:
> 1. start a compilation of something (ex. kernel)
> 2. run another process hungry application (flashplayer in firefox)
> => system locks in about 3-5mins

Ha, so we're getting somewhere :)

> I also noticed that the board gets pretty hot, so in my opinion it
> locks because of thermal issue.

The symptoms we're seeing so far are very much consistent with a thermal
issue.

> I think this also would explain why I see errors at different
> processes (mostly Xorg), but with 3.12 I do not get any trace message
> in the log files. Could you advise which option should be enabled in
> the kernel or how I could log/trace if system locks.

Try enabling CONFIG_LOCKUP_DETECTOR, that could tell us where we're
hanging.

But, make sure to be on a console and not in X in order to get a chance
to see the message. What I do is reroute all log messages to /dev/tty8,
i.e. have

*.* |/dev/tty8

in syslog.conf and switch to it with Ctrl-Alt-F8.

> How can I make sure that the cooling/temp works properly?
>
> Perhaps after upgrading in september the system is working under

What kind of upgrade exactly did you do to a laptop?

> heavier load and therefore I started having the issue, or something
> broke in software or hardware and it can not cool down properly. I
> don't think the kernel is the issue, because I had the same with older
> kernels that were working fine before.
>
> The fan looks clean and there is no dust or whatever in the cooling
> area, that would prevent colling. The physical position of the
> notebook (docking station) also did not change.

Does the issue happen if the laptop is not in the docking station?

In any case, you need to follow your steps back of the upgrade to have
at least a clue what causes the overheating.

Can you revert the upgrade and see whether it still happens?

Also, do you have sensors support for your hardware? IOW, can you
monitor the temperature of some hardware elements by running

$ sensors

?

For example, I see this on my box here:

$ sensors
fam15h_power-pci-00c4
Adapter: PCI adapter
power1: 45.64 W (crit = 125.19 W)

k10temp-pci-00c3
Adapter: PCI adapter
temp1: +19.2ÂC (high = +70.0ÂC)
(crit = +90.0ÂC, hyst = +87.0ÂC)

radeon-pci-0100
Adapter: PCI adapter
temp1: +80.0ÂC

so when something overheats, running "watch -n 1 sensors" could give
some hints.

Also, what does

$ grep . -EriIn /sys/devices/system/cpu/cpu0/cpufreq

give?

Also, can you connect your laptop to a serial or netconsole to collect
dmesg before and while the lockup happens?

Basically, we're looking for a hint about which part of the hw causes
the overheating...

HTH.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/