[RFC PATCH] Crashdump Accepting Active IOMMU

From: Sumner, William
Date: Thu Sep 26 2013 - 19:27:33 EST

This Request For Comment submission is primarily to solicit comments on a concept for how kdump can handle legacy DMA IO leftover from the panicked kernel and comments on early prototype code to implement it. Some level of interest was noted when I proposed this concept in June; however, for generating serious discussion there is no substitute for a working prototype.

This concept modifies the behavior of the iommu in the (new) crashdump kernel:
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue using its current buffers until the device drivers in the crashdump kernel initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers in the crashdump kernel than the iova ranges that were in-use at the time of the panic.

Advantages of this concept:
1. All manipulation of the IO-device is done by the Linux device-driver for that device.
2. This concept behaves in a very similar manner to operation without an active iommu.
3. Any activity between the IO-device and its RMRR areas is handled by the device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone. This supports the practice of creating a special kdump kernel without drivers for any devices that are not required for taking a crashdump.

About the early-prototype code in the patch below:
1. It works on one machine that reproduced the original problem -- still need to test it on a lot of other machines with various IO configurations.

2. Currently implemented for intel-iommu architecture only,

3. It is based near TOT from kernel.org. The TOT version of 'crash' reads the dump that is produced.

4. It is definitely prototype-only and not yet ready to propose as a patch for inclusion into Linux proper.

5. Although this patch is not yet intended for incorporation into mainstream Linux, it should install and operate for anyone who wants to experiment with it. Because this patch changes the low-level IO-operation, and because of its very-limited testing, I strongly advise against installing this patch on any system that contains production data.

6. For this RFC, I decided to leave-in all of the debugging, diagnostic, temporary, and test code so that it would be readily available. In a (future) patch submission, much of this would need to be either eliminated, separated into a diagnostics area, moved under conditional compilation, or something else. We'll see what the Linux community recommends.

At a high level, this code:
* is entirely within intel-iommu.c
* operates primarily during iommu initialization and device-driver initialization

During intel-iommu hardware initialization:
In intel_iommu_init(void)
* If (This is the crash kernel)
. Set flag: crashdump_accepting_active_iommu (all changes below check this)
. Skip disabling the iommu hardware translations

In init_dmars()
* Duplicate the intel iommu translation tables from the old kernel in the new kernel
. The root-entry table, all context-entry tables, and all page-translation-entry tables
. The duplicate tables contain updated physical addresses to link them together.
. The duplicate tables are mapped into kernel virtual addresses in the new kernel
which allows most of the existing iommu code to operate without change.
. Do some minimal sanity-checks during the copy
. Place the address of the new root-entry structure into "struct intel_iommu"

* Skip setting-up new domains for 'si', 'rmrr', 'isa'
. Translations for 'rmrr' and 'isa' ranges have been copied from the old kernel
. This prototype does not yet handle pass-through

* Existing (unchanged) code near the end of dmar_init:
. Loads the address of the (now new) root-entry structure from "struct intel_iommu"
into the iommu hardware and does the iommu hardware flushes. This changes the
active translation tables from the ones in the old kernel to the copies in the new kernel.
. This is legal because the translations in the two sets of tables are currently identical:
Intel(r) Virtualization Technology for Directed I/O. Architecture Specification,
February 2011, Rev. 1.3 (section 11.2, paragraph 2)

In iommu_init_domains()
* Mark as in-use all domain-id's from the old kernel
. In case the new kernel contains a device that was not in the old kernel
and a new, unused domain-id is actually needed, the bitmap will give us one.

When a new domain is created for a device:
* If (this device has a context in the old kernel)
. Get domain-id, address-width, and IOVA ranges from the old kernel context;
. Get address(page-entry-tables) from the copy in the new kernel;
. And apply all of the above values to the new domain structure.
* Else
. Create a new domain as normal

I would very much like the advice of the Linux community on how to proceed.

Signed-off-by: Bill Sumner <bill.sumner@xxxxxx>