Re: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

From: Huang, Kai
Date: Wed Jan 11 2023 - 21:26:14 EST


On Thu, 2023-01-12 at 09:59 +0800, Huang, Ying wrote:
> "Huang, Kai" <kai.huang@xxxxxxxxx> writes:
>
> > On Thu, 2023-01-12 at 08:56 +0800, Huang, Ying wrote:
> > > "Huang, Kai" <kai.huang@xxxxxxxxx> writes:
> > >
> > > > On Tue, 2023-01-10 at 08:18 -0800, Hansen, Dave wrote:
> > > > > On 1/10/23 04:09, Huang, Kai wrote:
> > > > > > On Mon, 2023-01-09 at 08:51 -0800, Dave Hansen wrote:
> > > > > > > On 1/9/23 03:48, Huang, Kai wrote:
> > > > > > > > > > > > This can also be enhanced in the future, i.e. by allowing adding non-TDX
> > > > > > > > > > > > memory to a separate NUMA node. In this case, the "TDX-capable" nodes
> > > > > > > > > > > > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > > > > > > > > > > > needs to guarantee memory pages for TDX guests are always allocated from
> > > > > > > > > > > > the "TDX-capable" nodes.
> > > > > > > > > >
> > > > > > > > > > Why does it need to be enhanced? What's the problem?
> > > > > > > >
> > > > > > > > The problem is after TDX module initialization, no more memory can be hot-added
> > > > > > > > to the page allocator.
> > > > > > > >
> > > > > > > > Kirill suggested this may not be ideal. With the existing NUMA ABIs we can
> > > > > > > > actually have both TDX-capable and non-TDX-capable NUMA nodes online. We can
> > > > > > > > bind TDX workloads to TDX-capable nodes while other non-TDX workloads can
> > > > > > > > utilize all memory.
> > > > > > > >
> > > > > > > > But probably it is not necessarily to call out in the changelog?
> > > > > > >
> > > > > > > Let's say that we add this TDX-compatible-node ABI in the future. What
> > > > > > > will old code do that doesn't know about this ABI?
> > > > > >
> > > > > > Right. The old app will break w/o knowing the new ABI. One resolution, I
> > > > > > think, is we don't introduce new userspace ABI, but hide "TDX-capable" and "non-
> > > > > > TDX-capable" nodes in the kernel, and let kernel to enforce always allocating
> > > > > > TDX guest memory from those "TDX-capable" nodes.
> > > > >
> > > > > That doesn't actually hide all of the behavior from users. Let's say
> > > > > they do:
> > > > >
> > > > > numactl --membind=6 qemu-kvm ...
> > > > >
> > > > > In other words, take all of this guest's memory and put it on node 6.
> > > > > There lots of free memory on node 6 which is TDX-*IN*compatible. Then,
> > > > > they make it a TDX guest:
> > > > >
> > > > > numactl --membind=6 qemu-kvm -tdx ...
> > > > >
> > > > > What happens? Does the kernel silently ignore the --membind=6? Or does
> > > > > it return -ENOMEM somewhere and confuse the user who has *LOTS* of free
> > > > > memory on node 6.
> > > > >
> > > > > In other words, I don't think the kernel can just enforce this
> > > > > internally and hide it from userspace.
> > > >
> > > > IIUC, the kernel, for instance KVM who has knowledge the 'task_struct' is a TDX
> > > > guest, can manually AND "TDX-capable" node masks to task's mempolicy, so that
> > > > the memory will always be allocated from those "TDX-capable" nodes. KVM can
> > > > refuse to create the TDX guest if it found task's mempolicy doesn't have any
> > > > "TDX-capable" node, and print out a clear message to the userspace.
> > > >
> > > > But I am new to the core-mm, so I might have some misunderstanding.
> > >
> > > KVM here means in-kernel KVM module? If so, KVM can only output some
> > > message in dmesg. Which isn't very good for users to digest. It's
> > > better for the user space QEMU to detect whether current configuration
> > > is usable and respond to users, via GUI, or syslog, etc.
> >
> > I am not against this. For instance, maybe we can add some dedicated error code
> > and let KVM return it to Qemu, but I don't want to speak for KVM guys. We can
> > discuss this more when we have patches actually sent out to the community.
>
> Error code is a kind of ABI too. :-)
>

Right. I can bring this up in the KVM TDX support series (when time is right)
to see whether KVM guys want such error code at the initial support, or just
want to extend in the future. The worst case is the old Qemu may not recognize
the new error code (while the error is still available in dmesg) but still can
do the right behaviour (stop to run, etc).