Re: [ANNOUNCE] Native Linux KVM tool

From: Anthony Liguori
Date: Fri Apr 08 2011 - 10:00:51 EST


On 04/08/2011 12:14 AM, Pekka Enberg wrote:
Hey, feel free to help out! ;-)

I don't agree that a working 2500 LOC program is 'repeating the same
architectural mistakes' as QEMU. I hope you realize that we've gotten
here with just three part-time hackers working from their proverbial
basements. So what you call mistakes, we call features for the sake of
simplicity.

And by all means, it's a good accomplishment.

But the mistakes I'm referring to aren't missing bits of code. It's that the current code makes really bad assumptions.

An example is ioport_ops. This maps directly to ioport_{read,write}_table in QEMU. Then you use ioport__register() to register entries in this table similar register_ioport_{read,write}() in QEMU.

The use of a struct is a small improvement but the fundamental design is flawed because it models a view of hardware where all devices are directly connected to the CPU. This is not how hardware works at all.

On the PC QEMU tries to emulate, a PIO operation flows from the CPU to the i440fx. The i440fx will do the first level of decoding treating the PCI host controller ports specially and then posting any I/Os in the PCI port range to the PCI bus. If no device selects these ports, or the ports fall into the non-PCI range, the I/O request is then posted to the PIIX3.

The PIIX3 will handle a good chunk of the I/O requests (via it's Super I/O chipset) and the remainder will be posted to the ISA bus. One or more ISA devices may then react to these posted I/O operation.

Really, having a flat table doesn't make sense. You should just send everything to an i440fx directly. Then the i440fx should decode what it can, and send it to the next level, and so forth.

You can get 90% of the way to working device model without modelling this type of flow, but you hit a wall pretty quickly as it's not unusual for PCI controllers to manipulate I/O requests in some fashion (particularly on non-x86 platforms). If you treat everything as directly attached to the CPU, it's impossible to model this.

Likewise, the same flow is true in the opposite direction. You use guest_flat_to_host() which assumes a linear mapping of guest memory to host memory. We used to do that too in QEMU (phys_ram_base + X). It took a long time to get rid of that assumption in QEMU.

There are multiple problems with this sort of assumption. The first is that you treat all devices as being directly attached to the memory controller. As with I/O instruction dispatch, this is not the case, and there are many PCI controllers that will munge these accesses (think IOMMU, for instance). The second is you assume that you're not doing I/O to device memory, but this does happen in practice. The cpu_physical_memory_rw() API is careful to support cases where you're writing data to I/O memory.

The other big problem here is that if you have open access to guest memory like this, you cannot easily track dirty information. Userspace accesses to guest memory will not result in KVM updating the guest dirty bitmap. You can add another API to explicitly set dirty bits (and that's exactly what we did a few years ago) but then you'll get extremely subtle bugs in migration if you're missing a dirty update somewhere. This is exactly how our API evolved in QEMU.

As I said earlier, there are very good reasons we do the things we do in QEMU. We're a large code base and there's far too much of the code base that noone cares about enough but that users are happy with. It's far too hard to make broad sweeping changes right now (although that's something we're trying to improve).

But I'd strongly suggest taking some of the advise being offered here. Don't ignore the hard problems to start out with because as the code base grows, it'll become more difficult to fix those. That's not to say that you need to implement migration tomorrow, but at least keep the constraints in mind and make sure that you're designing interfaces that let you do things like keep an updated dirty bitmap when you do memory accesses in userspace.

I also don't agree with this sentiment that unless we have SMP,
migration, yadda yadda yadda, now, it's impossible to change that in
the future. It ignores the fact that this is exactly how the Linux
kernel evolved

Over the course of 20 years. By my count, we still have another decade of refactoring before I can get on top of my ivory tower and call every other project terrible.

and the fact that we're aggressively trying to keep the
code size as small and tidy as possible so that changing things is as
easy as possible.

I've looked at QEMU sources over the years and especially over the
past year and I think you might be way too familiar with its inner
workings to see how complex (even the core code) has become for
someone who isn't familiar with it.

I have no doubts about the complexity of QEMU. But the 'goo' factor is not due to complexity, it's due to the fact that there's a lot of code that basically needs to be removed. But removing features from an existing project is never a popular thing to do particularly when the work well enough for a lot of people.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/