Re: Hot pluggable CPUs ( was Linux 2.5 / 2.6 TODO (preliminary) )

From: Malcolm Beattie (mbeattie@sable.ox.ac.uk)
Date: Mon Jun 05 2000 - 06:24:42 EST


James Sutherland writes:
> On Mon, 5 Jun 2000, David L. Nicol wrote:
>
> > James Sutherland wrote:
> >
> > > The kernel itself would be harder, of course. Kernel modules could do
> > > something similar - just unload the old one and reload the new one, taking
> > > care to avoid anything trying to use the module in the mean time - leaving
> > > just the core code - memory management etc., which would be much more
> > > difficult.
> >
> > RTlinux if I am not mistaken takes the stance that the whole linux business
> > is a low-priority real time process. I don't know how the rtlinux project
> > has been keeping up with kernel development. But there's a partitioning
> > system for you, if the RT microkernel (or whatever it is) is running
> > RT processes and one linux kernel, it could run two.
>
> How would it handle device drivers? Having two kernel device drivers each
> thinking they are running on a physical machine could upset things...
>
> The "hypervisor kernel" would probably have to handle all the device
> driver aspects - PCI bus, memory, CPUs, plus any shared resources (NICs,
> storage, perhaps) - but this could probably be done without too much
> upheaval, I think?
>
> The alternative would be looking at the user-mode kernel, and getting that
> running as an RTLinux task without any dependence on the "real" Linux
> kernel. RTLinux isn't something I'm familiar with - any thoughts on this?

I'm in the process of writing such a hypervisor/guest kernel
combination, called SILK (Simultaneous Instances of the Linux Kernel).
I may be talking about it somewhere like the O'Reilly Open Source
conference in Monterey next month. Here's the message I sent to the
linux-390 mailing list last month:

  Alan Cox writes:
> > Think about how you do this on VM - just give him the password of his own
> > virtual machine to reboot etc. Obviously adding disks etc is not really
> > something he needs to do, but migrating from one to the other may be.
>
> Or on x86 with user mode Linux. The one thing UML can't do though is to
> allow users to do their own kernel upgrades.
  
  The one I'm doing can (er, if all keeps going to plan). I started it
  when I saw on this list someone say "Linux can't do what VM can" so I
  thought "Oh, yeah?". It's slightly different in that we have source to
  Linux so the hypervisor (basically, a Linux kernel with PAGE_OFFSET of
  0xF0000000) and guest kernel (a Linux kernel with all hardware access
  stripped out and privileged stuff replaced by hypervisor calls) can be
  tuned to work nicely together and perform better.
  
  The guest kernels run unprivileged as a (fairly) ordinary task except
  that the hypervisor keeps extra stuff in task_struct, in particular a
  flag saying whether the task is in guest user mode or guest kernel
  mode. If a syscall trap arrives in guest user mode, it is propagated
  to the guest kernel otherwise it is handled as a hypervisor call
  (alloc/write page table, initiate I/O, return to user mode). The guest
  returns to user mode by calling the hypervisor which then unmaps the
  guest kernel (0xC0000000 - 0xF0000000) and flips back to unprivileged
  mode in the guest user. When the hypervisor gets a syscall (or other
  trap) in guest user mode, it maps back in the guest kernel and hands
  off to the guest's handler. All guest I/O is done via the hypervisor.
  The hypervisor is only given a small part of physical memory (32MB on
  a small box, higher on a large one) and the memory for the guest
  kernels is mapped from spare physical higher memory (in 4MB pages now
  that I've got user-mode "big" pages working about reasonably).
  
  I could have done the guest kernel mm using segments for better
  performance but this way it should be more portable: including S/390.
  It's called SILK (Simultaneous Instances of the Linux Kernel) and,
  obviously, I'll release it more widely when a guest kernel gets
  rather further in booting than it does now.

I do now have the guest kernel booting as far as trying to create
the first kernel thread for init. Now, it's "just" a question of
getting the trap/fault bounces coded, the mm hypervisor calls done and
the I/O calls done. The I/O stuff can just start with basic network and
NFS or nbd unless the block device level hypercall(dev,inout,addr,len)
followed by kmap/queue to real block device with prod-guest-when-done
turns out to be actually easier. It's all good fun. It'll be at
    http://users.ox.ac.uk/~mbeattie/linux-kernel.html
with my other Linux kernel stuff when it's further along.

Oh, and on the subject of hot swap CPUs and memory: S/390 can already
do it and has done it for years (decades, pretty much) at various
levels depending on whether you're using raw hardware, LPAR, VM (all
which are fine with Linux) or OS/390 (which has extra O/S and
application support for *really* hairy stuff like coping with a
hard failure of a live CPU and having the other CPUs notice, undo any
locks that the bad CPU had and let the running job know and handle
it, if it's been written to do so). Basically, as far as I recall
(and I've only been relearning S/390 stuff in the last few months
after leaving the S/370/MVS/VM world 15 years ago):
cope with soft CPU/memory failure and hot-replace CPU and memory:
  no problem (don't need LPAR or VM)
cope with hard memory failure:
  no problem (I think?). One bank of memory can die without affecting
  anything since it's error-correctable across banks. Roughly the same
  (if not *the* same?) as what IBM call "Chipkill" memory that you get
  in their Netfinity servers.
cope with hard CPU failure:
  I think VM (and maybe LPAR too these days though you used to have to
  allocate real CPUs instead of virtual ones to LPARs I think?) can
  cope with hard CPU failures, though you'll lose your kernel running
  in that VM. You can reboot it straight away and, since it's only a
  virtual CPU anyway, you'll come straight back up using the hot-spare
  CPU that S/390 has. If another one dies before it's replaced, you'll
  get (n-1)/n performance as your load spreads over the rest of the
  real CPUs.
How well you cope with failures of other things depends on how many of
them you bought, especially for I/O. You can get multiple of anything
and multipath through different ones as much as you like. Failure of
anything I/O related is completely transparent since even hardware I/O
instructions refer to only logical device addresses and microcode will
handle the multipathing (hmm, maybe you have to manually vary the paths
if you're not under VM or LPAR: I'm hazy). IBM will happily sell you a
new S/390 for running Linux on.

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Jun 07 2000 - 21:00:21 EST