Re: [PATCH][v8] PM / hibernate: Verify the consistent of e820 memory map by md5 value

From: Rafael J. Wysocki
Date: Tue Aug 30 2016 - 07:58:23 EST


On Monday, August 29, 2016 05:13:34 PM Pavel Machek wrote:
> On Mon 2016-08-29 15:41:34, Rafael J. Wysocki wrote:
> > On Mon, Aug 29, 2016 at 6:59 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> > > On Mon, Aug 29, 2016 at 12:35:40AM +0800, Chen Yu wrote:
> > >> On some platforms, there is occasional panic triggered when trying to
> > >> resume from hibernation, a typical panic looks like:
> > >>
> > >> "BUG: unable to handle kernel paging request at ffff880085894000
> > >> IP: [<ffffffff810c5dc2>] load_image_lzo+0x8c2/0xe70"
> > >>
> > >> This is because e820 map has been changed by BIOS across
> > >> hibernation, and one of the page frames from first kernel
> > >> is right located in second kernel's unmapped region, so panic
> > >> comes out when accessing unmapped kernel address.
> > >>
> > >> In order to expose this issue earlier, the md5 hash of e820 map
> > >> is passed from suspend kernel to resume kernel, and the system will
> > >> trigger panic once it finds the md5 value of previous kernel is not
> > >> the same as current resume kernel.
> > >
> > > ... so basically now even the cases where it managed to resume would
> > > panic because the digests differ, even if the original panic condition
> > > doesn't trigger the bug, i.e. your Note 1 below.
> > >
> > > The more important question IMHO would be, can we resume our system
> > > successfully *even* if BIOS fiddled with the e820 map?
> > >
> > > We'd still warn the hell out of it and even make that the md5 digest
> > > comparison a default-enabled thing without even having a config option
> > > to disable it but can we try harder not to panic and deal with this next
> > > BIOS f*ckup more intelligently than throwing our hands in the air and
> > > giving up?
> >
> > We need not panic in principle and I wouldn't do that.
> >
> > I would warn and try to continue regardless (which was the original
> > plan here AFAICS), or we change a possible data loss into a guaranteed
> > one.
> >
> > IMO it is sufficient to give up when a PFN we have image data for is
> > not pfn_valid() during resume, which we do already.
>
> Well... can you guarantee what will be effect of resuming with
> different memory map?
>
> Because there's big difference between panic and trying to continue
> with corrupted memory.

If all of the page frames the image kernel used before hibernation are
available during resume as well, memory won't really get corrupted, at least
not right away.

There may be problems going forward, but whether or not they actually happen
depends on what the differences are. So while an e820 mismatch indicates that
things may go wrong, it doesn't necessarily mean that they will.

Also, that panic() may cause hibernation to stop working in a sort of hard and
nasty way where it used to work flawlessly previously and that would be a
regression, so not really acceptable.

Thanks,
Rafael