Re: S4 resume broken since 2.6.39 (3.1, too)

From: Takashi Iwai
Date: Wed Sep 28 2011 - 12:19:46 EST


At Wed, 28 Sep 2011 16:45:45 +0200,
Takashi Iwai wrote:
>
> At Wed, 28 Sep 2011 16:45:48 +0200,
> Rafael J. Wysocki wrote:
> >
> > On Wednesday, September 28, 2011, Takashi Iwai wrote:
> > > At Wed, 28 Sep 2011 15:28:12 +0200,
> > > Takashi Iwai wrote:
> > > >
> > > > At Wed, 28 Sep 2011 15:26:19 +0200,
> > > > Rafael J. Wysocki wrote:
> > > > >
> > > > > On Wednesday, September 28, 2011, Takashi Iwai wrote:
> > > > > > At Tue, 27 Sep 2011 18:56:44 +0200,
> > > > > > Rafael J. Wysocki wrote:
> > > > > > >
> > > > > > > On Tuesday, September 27, 2011, Takashi Iwai wrote:
> > > > > > > > At Mon, 26 Sep 2011 19:26:44 -0700,
> > > > > > > > Yinghai Lu wrote:
> > > > > > > > >
> > > > > > > > > On 09/22/2011 11:11 AM, Takashi Iwai wrote:
> > > > > > > > >
> > > > > > > > > > At Thu, 22 Sep 2011 07:33:17 -0700,
> > > > > > > > > > Yinghai Lu wrote:
> > > > > > > > > >>
> > > > > > > > > >> On Wed, Sep 21, 2011 at 11:48 AM, Rafael J. Wysocki <rjw@xxxxxxx> wrote:
> > > > > > > > > >>> It looks like init_memory_mapping() is sometimes called with "end"
> > > > > > > > > >>> beyond the last mapped PFN and it explodes when we try to write stuff to
> > > > > > > > > >>> that address during image restoration.
> > > > > > > > > >>>
> > > > > > > > > >>> IOW, the Yinghai's assumption that init_memory_mapping() would always be
> > > > > > > > > >>> called with a "good end" on x86_64 was overomptimistic.
> > > > > > > > > >>
> > > > > > > > > >> for 64bit x86, kernel_physical_mapping_init() will use
> > > > > > > > > >> map_low_page()/call early_memmap() to access ram for page_table that is above
> > > > > > > > > >> rather last mapped PFN.
> > > > > > > > > >>
> > > > > > > > > >> the point is:
> > > > > > > > > >> on system with 64g, usable ram will be [0,2048m), [4g, 64g)
> > > > > > > > > >> init_memory_mapping will be called two times for them.
> > > > > > > > > >> before putting page_table high,
> > > > > > > > > >> page table will be two parts: one is just below 512M, and one below 2048m.
> > > > > > > > > >> after putting page_table high,
> > > > > > > > > >> page table will be two parts: one is just below 2048M, and one below 64G.
> > > > > > > > > >
> > > > > > > > > > So, how can this change break S4 resume?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > not sure.
> > > > > > > > >
> > > > > > > > > seems resume has it's own page table during transition...
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Any hint for further debugging?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > you may try to insert dead loop in arch/x86/power/hibernate_asm_64.S::restore_image or core_restore_code
> > > > > > > > >
> > > > > > > > > to see which part cause reset.
> > > > > > > >
> > > > > > > > That's the answer I was afraid of :)
> > > > > > >
> > > > > > > I wonder. Does hibernation work on the machine in question with
> > > > > > > acpi_sleep=nonvs and if so, is the problem reproducible with this
> > > > > > > setting?
> > > > > >
> > > > > > It still shows the same problem. Reboots after a few S4 cycles.
> > > > > > I'll test kernel S4.
> > > > >
> > > > > This most likely means that NVS is not involved.
> > > >
> > > > Indeed, kernel S4 also failed both with and without acpi_sleep=nonvs.
> > > >
> > > > > Have you tried to revert commit d1ee433 (x86, trampoline: Use the unified
> > > > > trampoline setup for ACPI wakeup)?
> > > >
> > > > Not yet. I'll check it later.
> > >
> > > Unfortunately the commit can't be reverted cleanly.
> > > The commits around it seem mixed up together.
> > >
> > > If you have a patch applicable to 3.0 or 3.1 kernel, let me know.
> >
> > No, I don't, but can you simply go back to that commit and see
> > if the issue is reproducible with HEAD = d1ee433, and if it is,
> > see if it still will be reproducible after going one commit back from there?
>
> Ah, that should be easy. I'll give it a try.

It turned out to be non-trivial. At the commit d1ee4335
[x86, trampoline: Use the unified trampoline setup for ACPI wakeup],
I can build by cherry-picking the binutils fix commit 2ae9d293
[x86: Fix binutils-2.21 symbol related build failures],
then merge Yinghai's patch series, the commit d2137d5af.
The test failed, as expected; S4 resume reboots.

Then going back to the commit 4822b7fc
[x86, trampoline: Common infrastructure for low memory trampolines]
it's no longer buildable. This commit alone breaks too many things
and requires the stuff in the commit above d1ee4335.

Backing off one would go to 2.6.38-rc5. Hmm.

Anyway I try to check whether 2.6.38-rc5 + merge d2137d5af would cause
any issue.


thanks,

Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/