Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

From: Huang, Kai
Date: Tue Jan 10 2023 - 07:12:39 EST


On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> On 1/9/23 03:48, Huang, Kai wrote:
> ...
> > > > > > This approach works as in practice all boot-time present DIMMs are TDX
> > > > > > convertible memory. However, if any non-TDX-convertible memory has been
> > > > > > hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
> > > > > > module, the module initialization will fail.
> > > >
> > > > I really don't know what this is trying to say.
> >
> > My intention is to explain and call out that such design (use all memory regions
> > in memblock at the time of module initialization) works in practice, as long as
> > non-CMR memory hasn't been added via memory hotplug.
> >
> > Not sure if it is necessary, but I was thinking it may help reviewer to judge
> > whether such design is acceptable.
>
> This is yet another case where you've mechanically described the "what",
> but left out the implications or the underlying basis "why".
>
> I'd take a more methodical approach to describe what is going on here.
> List the steps that must occur, or at least *one* example of those steps
> and how they intereact with the code in this patch. Then, explain the
> fallout.
>
> I also don't think it's quite right to call out "CXL memory via kmem
> driver". If the CXL memory was "System RAM", it should get covered by a
> CMR and TDMR. The kmem driver can still go wild with it.
>
> > > > *How* and *why* does this module initialization failure occur?
> >
> > If we pass any non-CMR memory to the TDX module, the SEAMCALL (TDH.SYS.CONFIG)
> > will fail.
>
> I'm frankly lost now. Please go back and try to explain this better.
> Let me know if you want to iterate on this faster than resending this
> series five more times. I've got some ideas.

Let me try to do my work first. Thanks.

>
> > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > memory to a separate NUMA node. In this case, the "TDX-capable" nodes
> > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > the "TDX-capable" nodes.
> > > >
> > > > Why does it need to be enhanced? What's the problem?
> >
> > The problem is after TDX module initialization, no more memory can be hot-added
> > to the page allocator.
> >
> > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > utilize all memory.
> >
> > But probably it is not necessarily to call out in the changelog?
>
> Let's say that we add this TDX-compatible-node ABI in the future. What
> will old code do that doesn't know about this ABI?

Right. The old app will break w/o knowing the new ABI. One resolution, I
think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
TDX guest memory from those "TDX-capable" nodes.

Anyway, perhaps we can just delete this part from the changelog?

>
> ...
> > > > > > + list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > > > + /*
> > > > > > + * The new range is TDX memory if it is fully covered by
> > > > > > + * any TDX memory block.
> > > > > > + *
> > > > > > + * Note TDX memory blocks are originated from memblock
> > > > > > + * memory regions, which can only be contiguous when two
> > > > > > + * regions have different NUMA nodes or flags. Therefore
> > > > > > + * the new range cannot cross multiple TDX memory blocks.
> > > > > > + */
> > > > > > + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > > > > > + return true;
> > > > > > + }
> > > > > > + return false;
> > > > > > +}
> > > >
> > > > I don't really like that comment. It should first state its behavior
> > > > and assumptions, like:
> > > >
> > > > This check assumes that the start_pfn<->end_pfn range does not
> > > > cross multiple tdx_memlist entries.
> > > >
> > > > Only then should it describe why that is OK:
> > > >
> > > > A single memory hotplug even across mutliple memblocks (from
> > > > which tdx_memlist entries are derived) is impossible. ... then
> > > > actually explain
> > > >
> >
> > How about below?
> >
> > /*
> > * This check assumes that the start_pfn<->end_pfn range does not cross
> > * multiple tdx_memlist entries. A single memory hotplug event across
> > * multiple memblocks (from which tdx_memlist entries are derived) is
> > * impossible. That means start_pfn<->end_pfn range cannot exceed a
> > * tdx_memlist entry, and the new range is TDX memory if it is fully
> > * covered by any tdx_memlist entry.
> > */
>
> I was hoping you would actually explain why it is impossible.
>
> Is there something fundamental that keeps a memory area that spans two
> nodes from being removed and then a new area added that is comprised of
> a single node?
> Boot time:
>
> | memblock | memblock |
> <--Node=0--> <--Node=1-->
>
> Funky hotplug... nothing to see here, then:
>
> <--------Node=2-------->

I must have missed something, but how can this happen?

I had memory that this cannot happen because the BIOS always allocates address
ranges for all NUMA nodes during machine boot. Those address ranges don't
necessarily need to have DIMM fully populated but they don't change during
machine's runtime.

>
> I would believe that there is no current bare-metal TDX system that has
> an implementation like this. But, the comments above speak like it's
> fundamentally impossible. That should be clarified.
>
> In other words, that comment talks about memblock attributes as being
> the core underlying reason that that simplified check is OK. Is that
> it, or is it really the reduced hotplug feature set on TDX systems?

Let me do more homework and get back to you.

>
>
> ...
> > > > > > + * Build the list of "TDX-usable" memory regions which cover all
> > > > > > + * pages in the page allocator to guarantee that. Do it while
> > > > > > + * holding mem_hotplug_lock read-lock as the memory hotplug code
> > > > > > + * path reads the @tdx_memlist to reject any new memory.
> > > > > > + */
> > > > > > + get_online_mems();
> > > >
> > > > Oh, it actually uses the memory hotplug locking for list protection.
> > > > That's at least a bit subtle. Please document that somewhere in the
> > > > functions that actually manipulate the list.
> >
> > add_tdx_memblock() and free_tdx_memlist() eventually calls list_add_tail() and
> > list_del() to manipulate the list, but they actually takes 'struct list_head
> > *tmb_list' as argument. 'tdx_memlist' is passed to build_tdx_memlist() as input.
> >
> > Do you mean document the locking around the implementation of add_tdx_memblock()
> > and free_tdx_memlist()?
> >
> > Or should we just mention it around the 'tdx_memlist' variable?
> >
> > /* All TDX-usable memory regions. Protected by memory hotplug locking. */
> > static LIST_HEAD(tdx_memlist);
>
> I don't think I'd hate it being in all three spots. Also "protected by
> memory hotplug locking" is pretty generic. Please be more specific.

OK will do.

>
> > > > I think it's also worth saying something here about the high-level
> > > > effects of what's going on:
> > > >
> > > > Take a snapshot of the memory configuration (memblocks). This
> > > > snapshot will be used to enable TDX support for *this* memory
> > > > configuration only. Use a memory hotplug notifier to ensure
> > > > that no other RAM can be added outside of this configuration.
> > > >
> > > > That's it, right?
> >
> > Yes. I'll somehow include above into the comment around get_online_mems().
> >
> > But should I move "Use a memory hotplug notifier ..." part to:
> >
> > err = register_memory_notifier(&tdx_memory_nb);
> >
> > because this is where we actually use the memory hotplug notifier?
>
> I actually want that snippet in the changelog.

Will do.

[snip]