Re: [PATCH net-next v4 6/7] page_pool: check for DMA sync shortcut earlier

From: Robin Murphy
Date: Thu May 02 2024 - 11:50:05 EST


On 24/04/2024 9:52 am, Alexander Lobakin wrote:
From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
Date: Tue, 23 Apr 2024 15:58:31 +0200

We can save a couple more function calls in the Page Pool code if we
check for dma_need_sync() earlier, just when we test pp->p.dma_sync.
Move both these checks into an inline wrapper and call the PP wrapper
over the generic DMA sync function only when both are true.
You can't cache the result of dma_need_sync() in &page_pool, as it may
change anytime if an SWIOTLB buffer is allocated or mapped.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
---
net/core/page_pool.c | 31 +++++++++++++++++--------------
1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 6cf26a68fa91..87319c6365e0 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -398,16 +398,24 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
return page;
}
-static void page_pool_dma_sync_for_device(const struct page_pool *pool,
- const struct page *page,
- unsigned int dma_sync_size)
+static void __page_pool_dma_sync_for_device(const struct page_pool *pool,
+ const struct page *page,
+ u32 dma_sync_size)
{
dma_addr_t dma_addr = page_pool_get_dma_addr(page);
dma_sync_size = min(dma_sync_size, pool->p.max_len);
- dma_sync_single_range_for_device(pool->p.dev, dma_addr,
- pool->p.offset, dma_sync_size,
- pool->p.dma_dir);
+ __dma_sync_single_for_device(pool->p.dev, dma_addr + pool->p.offset,
+ dma_sync_size, pool->p.dma_dir);

Breh, this breaks builds on !DMA_NEED_SYNC systems, as this function
isn't defined there, and my CI didn't catch it in time... I'll ifdef
this in the next spin after some reviews for this one.

Hmm, the way things have ended up, do we even need this change? (Other
than factoring out the pool->dma_sync check, which is clearly nice)

Since __page_pool_dma_sync_for_device() is never called directly, the
flow we always get is:

page_pool_dma_sync_for_device()
dma_dev_need_sync()
__page_pool_dma_sync_for_device()
... // a handful of trivial arithmetic
__dma_sync_single_for_device()

i.e. still effectively the same order of
"if (dma_dev_need_sync()) __dma_sync()" invocations as if we'd just used
the standard dma_sync(), so referencing the unwrapped internals only
spreads it out a bit more for no real benefit. As an experiment I tried
the diff below on top, effectively undoing this problematic part, and it
doesn't seem to make any appreciable difference in a before-and-after
comparison of the object code, at least for my arm64 build.

Thanks,
Robin.

----->8-----
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 27f3b6db800e..b8ab7d4ca229 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -398,24 +398,20 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
return page;
}
-static void __page_pool_dma_sync_for_device(const struct page_pool *pool,
- const struct page *page,
- u32 dma_sync_size)
-{
- dma_addr_t dma_addr = page_pool_get_dma_addr(page);
-
- dma_sync_size = min(dma_sync_size, pool->p.max_len);
- __dma_sync_single_for_device(pool->p.dev, dma_addr + pool->p.offset,
- dma_sync_size, pool->p.dma_dir);
-}
-
static __always_inline void
page_pool_dma_sync_for_device(const struct page_pool *pool,
const struct page *page,
u32 dma_sync_size)
{
- if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
- __page_pool_dma_sync_for_device(pool, page, dma_sync_size);
+ dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+
+ if (!pool->dma_sync)
+ return;
+
+ dma_sync_size = min(dma_sync_size, pool->p.max_len);
+ dma_sync_single_range_for_device(pool->p.dev, dma_addr,
+ pool->p.offset, dma_sync_size,
+ pool->p.dma_dir);
}
static bool page_pool_dma_map(struct page_pool *pool, struct page *page)