Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data

From: Jason A. Donenfeld
Date: Wed Feb 08 2023 - 13:10:06 EST


On Wed, Feb 8, 2023 at 2:54 PM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> Hi Nathan (and MST),
>
> On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@xxxxxxxxxx> wrote:
> >
> > Hi Jason,
> >
> > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > > The setup_data links are appended to the compressed kernel image. Since
> > > the kernel image is typically loaded at 0x100000, setup_data lives at
> > > `0x100000 + compressed_size`, which does not get relocated during the
> > > kernel's boot process.
> > >
> > > The kernel typically decompresses the image starting at address
> > > 0x1000000 (note: there's one more zero there than the compressed image
> > > above). This usually is fine for most kernels.
> > >
> > > However, if the compressed image is actually quite large, then
> > > setup_data will live at a `0x100000 + compressed_size` that extends into
> > > the decompressed zone at 0x1000000. In other words, if compressed_size
> > > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > > clobber setup_data, resulting in crashes.
> > >
> > > Visually, what happens now is that QEMU appends setup_data to the kernel
> > > image:
> > >
> > > kernel image setup_data
> > > |--------------------------||----------------|
> > > 0x100000 0x100000+l1 0x100000+l1+l2
> > >
> > > The problem is that this decompresses to 0x1000000 (one more zero). So
> > > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> > >
> > > kernel image setup_data
> > > |--------------------------||----------------|
> > > 0x100000 0x100000+l1 0x100000+l1+l2
> > >
> > > d e c o m p r e s s e d k e r n e l
> > > |-------------------------------------------------------------|
> > > 0x1000000 0x1000000+l3
> > >
> > > The decompressed kernel seemingly overwriting the compressed kernel
> > > image isn't a problem, because that gets relocated to a higher address
> > > early on in the boot process, at the end of startup_64. setup_data,
> > > however, stays in the same place, since those links are self referential
> > > and nothing fixes them up. So the decompressed kernel clobbers it.
> > >
> > > Fix this by appending setup_data to the cmdline blob rather than the
> > > kernel image blob, which remains at a lower address that won't get
> > > clobbered.
> > >
> > > This could have been done by overwriting the initrd blob instead, but
> > > that poses big difficulties, such as no longer being able to use memory
> > > mapped files for initrd, hurting performance, and, more importantly, the
> > > initrd address calculation is hard coded in qboot, and it always grows
> > > down rather than up, which means lots of brittle semantics would have to
> > > be changed around, incurring more complexity. In contrast, using cmdline
> > > is simple and doesn't interfere with anything.
> > >
> > > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > > after the fact. So this hack is updated to account for this appending,
> > > by reserving some bytes.
> > >
> > > Cc: x86@xxxxxxxxxx
> > > Cc: Philippe Mathieu-Daudé <philmd@xxxxxxxxxx>
> > > Cc: H. Peter Anvin <hpa@xxxxxxxxx>
> > > Cc: Borislav Petkov <bp@xxxxxxxxx>
> > > Cc: Eric Biggers <ebiggers@xxxxxxxxxx>
> > > Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>
> >
> > I apologize if this has already been reported/fixed already (I did a
> > brief search on lore.kernel.org) or if my terminology is not as precise
> > as it could be, I am a little out of my element here :)
> >
> > After this change as commit eac7a7791b ("x86: don't let decompressed
> > kernel image clobber setup_data") in QEMU master, I am no longer able to
> > boot a kernel directly through OVMF using '-append' + '-initrd' +
> > '-kernel'. Instead, systemd-boot tries to start the distribution's
> > kernel, which fails with:
> >
> > Error registering initrd: Already started
> >
> > This can be reproduced with just a defconfig Linux kernel (I used
> > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> > boot testing [1], and OVMF. Prior to this change, the kernel would start
> > up but after, I am dumped into the UEFI shell (as there is no
> > bootloader).
> >
> > The QEMU command I used was:
> >
> > $ qemu-system-x86_64 \
> > -kernel arch/x86_64/boot/bzImage \
> > -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> > -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> > -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
>
> Oh no... Without jumping into it, at first glance, I have absolutely
> no idea. I suppose I could start debugging it and probably come up
> with a solution, but...
>
> @mst - I'm beginning to think that this whole setup_data route is
> cursed. This is accumulating hacks within hacks within hacks. What
> would you think if I just send a patch *removing* all use of
> setup_data (the rng seed and the dtb thing), and then we can gradually
> add that back with an actual overarching design. For example, it'd
> probably make sense to have a separate fwcfg file for setup_data
> rather than trying to mangle and existing one, etc. This way, we
> unbreak the tree, and let the new approach be reviewed more
> reasonably.

Sent as https://lore.kernel.org/qemu-devel/20230208180835.234638-1-Jason@xxxxxxxxx/