Re: [PATCH net 1/9] net/mlx5: Ensure fw pages are always allocated on same NUMA

From: Moshe Shemesh
Date: Thu Jun 19 2025 - 12:32:33 EST

Next message: Jan Kara: "Re: [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks"
Previous message: David Lechner: "Re: [PATCH v2 2/9] dt-bindings: spi: zynqmp-qspi: Add example dual upper/lower bus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/15/2025 5:44 PM, Zhu Yanjun wrote:

External email: Use caution opening links or attachments

在 2025/6/14 22:55, Moshe Shemesh 写道:

On 6/13/2025 7:22 PM, Zhu Yanjun wrote:

在 2025/6/10 8:15, Mark Bloch 写道:

From: Moshe Shemesh <moshe@xxxxxxxxxx>

When firmware asks the driver to allocate more pages, using event of
give_pages, the driver should always allocate it from same NUMA, the
original device NUMA. Current code uses dev_to_node() which can result
in different NUMA as it is changed by other driver flows, such as
mlx5_dma_zalloc_coherent_node(). Instead, use saved numa node for
allocating firmware pages.

I'm not sure whether NUMA balancing is currently being considered or
not.

If I understand correctly, after this commit is applied, all pages will
be allocated from the same NUMA node — specifically, the original
device's NUMA node. This seems like it could lead to NUMA imbalance.

The change is applied only on pages allocated for FW use. Pages which
are allocated for driver use as SQ/RQ/CQ/EQ etc, are not affected by
this change.

As for FW pages (allocated for FW use), we did mean to use only the
device close NUMA, we are not looking for balance here. Even before
the change, in most cases, FW pages are allocated from device close
NUMA, the fix only ensures it.

Thanks a lot. I’m fine with your explanations.

In the past, I encountered a NUMA-balancing issue where memory
allocations were dependent on the mlx5 device. Specifically, memory was
allocated only from the NUMA node closest to the mlx5 device. As a
result, during the lifetime of the process, more than 100GB of memory
was allocated from that single NUMA node, while other NUMA nodes saw no
significant allocations. This led to a NUMA imbalance problem.

According to your commit, SQ/RQ/CQ/EQ are not affected—only the firmware
(FW) pages are. These FW pages include Memory Region (MR) and On-Demand
Paging (ODP) pages. ODP pages are freed after use, and the amount of MR
pages remains fixed throughout the process lifecycle. Therefore, in
theory, this commit should not cause any NUMA imbalance. However, since
production environments can be complex, I’ll monitor for any NUMA
balancing issues after this commit is deployed in production.

Thanks for monitoring it.
Just to clarify, this change does not affect also MR allocation. It affects pages allocated for FW internal use, handling requests from FW using give_pages() function and manage_pages command.

In short, I’m fine with both this commit and your explanations.

Thanks,
Moshe.

Thanks,

Yanjun.Zhu

By using dev_to_node, it appears that pages could be allocated from
other NUMA nodes, which might help maintain better NUMA balance.

In the past, I encountered a NUMA balancing issue caused by the mlx5
NIC, so using dev_to_node might be beneficial in addressing similar
problems.

Thanks,
Zhu Yanjun

Fixes: 311c7c71c9bb ("net/mlx5e: Allocate DMA coherent memory on
reader NUMA node")
Signed-off-by: Moshe Shemesh <moshe@xxxxxxxxxx>
Reviewed-by: Tariq Toukan <tariqt@xxxxxxxxxx>
Signed-off-by: Mark Bloch <mbloch@xxxxxxxxxx>
---
drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/
drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
index 972e8e9df585..9bc9bd83c232 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
@@ -291,7 +291,7 @@ static void free_4k(struct mlx5_core_dev *dev,
u64 addr, u32 function)
static int alloc_system_page(struct mlx5_core_dev *dev, u32 function)
{
      struct device *device = mlx5_core_dma_dev(dev);
-     int nid = dev_to_node(device);
+     int nid = dev->priv.numa_node;
      struct page *page;
      u64 zero_addr = 1;
      u64 addr;

--
Best Regards,
Yanjun.Zhu

Next message: Jan Kara: "Re: [PATCH v2 3/6] ext4: restart handle if credits are insufficient during allocating blocks"
Previous message: David Lechner: "Re: [PATCH v2 2/9] dt-bindings: spi: zynqmp-qspi: Add example dual upper/lower bus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]