Re: live kernel upgrades
From: Pavel Emelyanov
Date: Tue Feb 24 2015 - 08:19:29 EST
On 02/24/2015 03:11 PM, Jiri Slaby wrote:
> On 02/22/2015, 10:46 AM, Ingo Molnar wrote:
>> Arbitrary live kernel upgrades could be achieved by
>> starting with the 'simple method' I outlined in earlier
>> mails, using some of the methods that kpatch and kGraft are
>> both utilizing or planning to utilize:
>>
>> - implement user task and kthread parking to get the
>> kernel into quiescent state.
>>
>> - implement (optional, thus ABI-compatible)
>> system call interruptability and restartability
>> support.
>>
>> - implement task state and (limited) device state
>> snapshotting support
>>
>> - implement live kernel upgrades by:
>>
>> - snapshotting all system state transparently
>>
>> - fast-rebooting into the new kernel image without
>> shutting down and rebooting user-space, i.e. _much_
>> faster than a regular reboot.
>>
>> - restoring system state transparently within the new
>> kernel image and resuming system workloads where
>> they were left.
>>
>> Even complex external state like TCP socket state and
>> graphics state can be preserved over an upgrade. As far as
>> the user is concerned, nothing happened but a brief pause -
>> and he's now running a v3.21 kernel, not v3.20.
>>
>> Obviously one of the simplest utilizations of live kernel
>> upgrades would be to apply simple security fixes to
>> production systems. But that's just a very simple
>> application of a much broader capability.
>>
>> Note that if done right, then the time to perform a live
>> kernel upgrade on a typical system could be brought to well
>> below 10 seconds system stoppage time: adequate to the vast
>> majority of installations.
>>
>> For special installations or well optimized hardware the
>> latency could possibly be brought below 1 second stoppage
>> time.
>
> Hello,
>
> IMNSHO, you cannot.
>
> The criu-based approach you have just described is already alive as an
> external project in Parallels. It is of course a perfect solution for
> some use cases. But its use case is a distinctive one. It is not our
> competitor, it is our complementer. I will try to explain why.
I fully agree -- these two approaches are not replacement for one another.
The only possible way do go is to use them both, each where appropriate.
But I have some comments to Jiri's, please find them inline.
> It is highly dependent on HW. Kexec is not (or any other arbitrary
> kernel-exchange mechanism would not be) supported by all HW, neither
> drivers. There is not even a way to implement snapshotting for some
> devices which is a real issue, obviously.
>
> Downtime is highly dependent on the scenario. If you have a plenty of
> dirty memory, you have to flush first. This might be minutes, especially
> when using a network FS. Or you need not, but a failure to replace a
> kernel is then lethal.
It's not completely true. We can leave the dirty memory in-memory (only
flush the critical metadata) and remember it being such. If during reboot
the node crashes, then this would just look like if you write()-ed into a
file and then crashed w/o fsync(). I.e. -- the update of the page cache
is lost, but it's, well, sometimes expected to get lost.
This doesn't improve the downtime completely, but helps to reduce one
several times.
> If you have a heap of open FD, restore time will take ages.
One trick exists here too. For disk-FS we can pick up the "relevant parts"
of block device cache in memory before freezeing the processes. In this
case open()-s would bring up dentries and inodes without big I/O, it would
still take a while, but not ages.
> You cannot fool any of those. It's pure I/O. You cannot
> estimate the downtime and that is a real downside.
Generally you're right. We cannot bring the downtime to the same small values
as real live-patching do. Not even close to it. But there are tricks that can
shorten it.
> Even if you can get the criu time under one second, this is still
> unacceptable for live patching. Live patching shall be by 3 orders of
> magnitude faster than that, otherwise it makes no sense. If you can
> afford a second, you probably already have a large enough windows or
> failure handling to perform a full and mainly safer reboot/kexec anyway.
Fully agree here.
> You cannot restore everything.
> * TCP is one of the pure beasts in this. And there is indeed a plenty of
> theoretical papers behind this, explaining what can or cannot be done.
> * NFS is another one.
> * Xorg. Today, we cannot even fluently switch between discreet and
> native GFX chip. No go.
> * There indeed are situations, where NP-hard problems need to be solved
> upon restoration. No way, if you want to restore yet in this century.
+1. Candidates for those are process linkage (pids, ppids, sids and pgids,
especially orphaned) and mountpoints (all of them -- shared, slave, private,
bind and real) in several mount namespaces.
> While you cannot live-patch everything using KLP, it is patch-dependent.
> Failure of restoration is condition-dependent and the condition is
> really fuzzy. That is a huge difference.
>
> Despite you put criu-based approach as provably safe and correct, it is
> not in many cases and cannot be by definition.
>
> That said, we are not going to start moving that way, except the many
> good points which emerged during the discussion (fake signals to pick one).
>
>> This 'live kernel upgrades' approach would have various
>> advantages:
>>
>> - it brings together various principles working towards
>> shared goals:
>>
>> - the boot time reduction folks
>> - the checkpoint/restore folks
>> - the hibernation folks
>> - the suspend/resume and power management folks
>> - the live patching folks (you)
>> - the syscall latency reduction folks
>>
>> if so many disciplines are working together then maybe
>> something really good and long term maintainble can
>> crystalize out of that effort.
>
> I must admit, whenever I implemented something in the kernel, nobody did
> any work for me. So the above will only result in live patching teams to
> do all the work. I am not saying we do not want to do the work. I am
> only pointing out that there is nothing like "work together with other
> teams" (unless we are sending them their pay-bills).
>
>> - it ignores the security theater that treats security
>> fixes as a separate, disproportionally more important
>> class of fixes and instead allows arbitrary complex
>> changes over live kernel upgrades.
>
> Hmm, more changes, more regressions. Complex changes, even more
> regressions. No customer desires complex changes in such udpates.
>
>> - there's no need to 'engineer' live patches separately,
>> there's no need to review them and their usage sites
>> for live patching relevant side effects. Just create a
>> 'better' kernel as defined by users of that kernel:
>
> Review is the basic process which has to be done in any way.
>
> ABI is stable, not much in reality. criu has the same deficiency as KLP
> in here:
> * One example is file size of entries in /sys or /proc. That can change
> and you have to take care of it as processes "assume" something.
> * Return values of syscalls are standardized, but nothing protects
> anybody to change them in subsequent kernels. But state machines in
> processes might be confused by a different retval from two subsequent
> syscalls (provided by two kernels).
Well, this bites us even without C/R. We've seen several times that a container
with some "stabilized" distro fails on newer kernel due to ESRCH is returned
instead of ENOENT.
>> - in the enterprise distro space create a more stable
>> kernel and allow transparent upgrades into it.
>
> This is IMHO unsupportable.
>
>> We have many of the building blocks in place and have them
>> available:
>>
>> - the freezer code already attempts at parking/unparking
>> threads transparently, that could be fixed/extended.
>
> That is broken in many funny ways. It needs to be fixed in any case:
> nothing defines a good freezing point and something of course should.
> And if we want to use those well-defined points? No doubt. Freezer will
> benefit of course too.
BTW, we don't need freezer. The thing is that we need to inject parasite code
into processes we dump, so the process should remain in runnable state. Thus
we stop and freeze tasks with PTRACE_SEIZE.
And this brings another obstacle to this kind of kernel update -- if one is
sitting under gdb/strace, then no kernel update.
>> - hibernation, regular suspend/resume and in general
>> power management has in essence already implemented
>> most building blocks needed to enumerate and
>> checkpoint/restore device state that otherwise gets
>> lost in a shutdown/reboot cycle.
>
> Not at all. A lot of suspend/resume hooks results only in
> shutdown/reset. That is not what criu wants. And in many cases,
> implementing c/r is not feasible (see above).
>
>> A feature like arbitrary live kernel upgrades would be well
>> worth the pain and would be worth the complications, and
>> it's actually very feasible technically.
>
> Yes, I like criu very much, but it is not going to save the Universe as
> are your beliefs. Neither KLP. Remember, they are complementary. Maybe
> Pavel can comment on this too.
I agree, these two technology are complementary. And I'd also add that their
potential users differ. For example hosting providers like the live-patching
a lot, since they always provide to their customers the kernel with all (most)
the recent security hot-fixes applied. But they don't use CRIU-based update
actively since they do major update either when node crashes or by migrating
containers from one node to anther and replacing the node afterwards.
>> The goals of the current live kernel patching projects,
>> "being able to apply only the simplest of live patches",
>> which would in my opinion mostly serve the security
>> theater?
>
> No, we all pray to the KISS principle. Having something basic, which
> works and can be built upon is everything we want as the starting point.
> Extending the functionality is the right way. Not our idea, the recent
> SW management point of views tell. User-driven development is called one
> successful.
>
>> They are not forward looking enough, and in that
>> sense they could even be counterproductive.
>
> Being able to apply over 90 % of CVEs in 3.20 does not sound bad or
> counterproductive to me at all, sorry.
Thanks,
Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/