Re: [PATCH 00/16] media: platform: rga: Add RGA3 support

From: Sven Püschel
Date: Fri Oct 10 2025 - 04:34:04 EST


Hi Nicolas,

On 10/7/25 8:06 PM, Nicolas Dufresne wrote:
Hi,

Le mardi 07 octobre 2025 à 10:31 +0200, Sven Püschel a écrit :
This series adds support for the Raster Graphic Acceleration 3 (RGA3)
peripheral, which is included in the RK3588 SoC. Unlike the RGA2 it
can use the existing rockchip-iommu-v2 driver to handle iommu mappings.
Also the RK3588 contains two independent RGA3 cores.
Thanks for working on this.

Only scaling and format conversions between common 8bit RGB/YUV formats
are implemented. Also the color space conversion is fixed to BT601F.
This already allows a practical usage of the RGA3.
This seems quite limiting, can we expect an update on this, can't be that hard
to fully implement.

Sorry, but currently there is no need to have a fully featured implementation from our side. As the datasheet mentions that the RGA3 should do 4 or 2 pixel/cycle depending on the operation

This was tested on a Radxa Rock 5T. With the increased clock speeds in
the devicetree around 160 fps were measured when scaling and converting
This is quite vague, I've checked the patch and you didn't extend either there.
Is that an overclock or was it miss-configured ? Does RK implement a devfreq ?
Should that be moved with a voltage adjustement ? Is there any thermal nearby we
should monitor ?

This is mainly the result of a very low performance in the initial testing. We were quite disappointed looking at 30 fps output. The datasheet mentions the core should do 2 or 4 pixel/cycle, so we looked if the clock speed could be increased. The TRM Part1 mentions that the RGA3 clock uses a default divider of 2, so I've tweaked the dtsi to avoid the clock divider and run it on the fastest clock.

But this tweaking only increased the frame rate to around 36fps. After some brainstorming and testing we found the culprit being the RGA3_WR_SW_OUTSTANDING_MAX value in the command. With this value maxed out and without the clock tweaks I've got around 80fps. As the clock increase resulted in the expected doubling of the fps and my few tests worked, I've included it in the patch set.

I haven't done any stress testing and don't mind to remove the clock speed adjustments from the dtsi.

from RGBA 480x360 to NV12 3840x2160. Without the clock speed scaling a
default clock division factor of 2 is used and only around 80 fps are
reached with one core. The v4l2-compliance tests only complain about
the already failing colorspace propagation:
Did you do any more format testing to validation all supported combinations ?
This is a tool [0] you can use to test this using GStreamer and how to use it
[1].

Thanks for the link!

I've did some simple format conversion tests with a static test pattern.

The tests mainly converts any combination of RGB/YUV formats (hope I didn't miss anything) to each other. Then I convert it back to rgba with gstreamer and compare it's hash.

For scaling I've just tested one upscale, downscale and scale to a non aligned width/height.

[0]https://gitlab.collabora.com/mediatek/aiot/lava-test-definitions/-/tree/main/avvideocompare?ref_type=heads
[1]https://gitlab.collabora.com/mediatek/aiot/linux/-/blob/mediatek-next/.gitlab-ci.yml?ref_type=heads#L282
  v4l2-compliance 1.28.1, 64 bits, 64-bit time_t
  ...
   fail: v4l2-test-formats.cpp(923): fmt_cap.g_colorspace() !=
col
   test VIDIOC_S_FMT: FAIL
  ...
  Total for rockchip-rga device /dev/video0: 47, Succeeded: 46, Failed: 1,
Warnings: 0

  v4l2-compliance 1.28.1, 64 bits, 64-bit time_t
  ...
   fail: v4l2-test-formats.cpp(923): fmt_cap.g_colorspace() !=
col
   test VIDIOC_S_FMT: FAIL
  ...
  Total for rockchip-rga device /dev/video1: 47, Succeeded: 46, Failed: 1,
Warnings: 0

  v4l2-compliance 1.28.1, 64 bits, 64-bit time_t
  ...
   fail: v4l2-test-formats.cpp(923): fmt_cap.g_colorspace() !=
col
   test VIDIOC_S_FMT: FAIL
  ...
  Total for rockchip-rga device /dev/video2: 47, Succeeded: 46, Failed: 1,
Warnings: 0

Each RGA core is a separate /dev/video device. To distinguish the RGA2
core from the RGA3 cores the Card type is set accordingly. Combining all
cores into a single device and scheduling tasks to the best core might
be a future improvement, if it is desired by upstream to handle the
scheduling and selection in kernel space.
It took me some time to understand why you spoke about multicore here. You
forgot to say here that you add RGA3 into RGA2 driver. Some information on why
you went that path instead of a separate driver.

Mostly as I've started by using the rga driver as a basis and just adjusted the command stream and register values to the RGA3. I was unsure, if I should create a separate driver.
As it didn't seem unfeasible to have the existing driver handle both units, I've decided to add it to the existing driver to avoid code duplication.

But looking at your comments about the wrong announcement of e.g. color space conversion, I now think that a new driver is probably better to avoid adding too much of the differences to the struct.

From high level view, I don't think its a good idea to multi-plex over
heterogeneous core. They may not even produce the exact same pixels for the same
operation. They also don't share the same MMU, and at first glance, the use of
rkiommu in RGA3 means it can no longer handle CPU cache (though I don't know if
this is implemented/supported in upstream RGA2 driver).

Thanks for the insight. This gives me another reason to create a separate driver. I'll probably also look into multiplexing the 2 RGA3 cores to only expose one RGA3 video device to userspace (the current implementation exposes both cores individually to the userspace)

Sincerely

    Sven

Patch 1-2 are general cleanups
Patch 3-12 prepare the rga driver for the RGA3
Patch 13 documments the RGA3 compatible value
Patch 14 adds the RGA3 cores to the rk3588 dtsi
Patch 15 increases the RGA3 core clock speeds
Patch 16 adds RGA3 support to the rga driver

Signed-off-by: Sven Püschel<s.pueschel@xxxxxxxxxxxxxx>
---
Sven Püschel (16):
      media: rockchip: rga: use clk_bulk api
      media: rockchip: rga: use stride for offset calculation
      media: rockchip: rga: align stride to 16 bytes
      media: rockchip: rga: move hw specific parts to a dedicated struct
      media: rockchip: rga: use card type to specify rga type
      media: rockchip: rga: change offset to dma_addresses
      media: rockchip: rga: support external iommus
      media: rockchip: rga: remove size from rga_frame
      media: rockchip: rga: remove stride from rga_frame
      media: rockchip: rga: move rga_fmt to rga-hw.h
      media: rockchip: rga: add iommu restore function
      media: rockchip: rga: handle error interrupt
      media: dt-bindings: media: rockchip-rga: add rockchip,rk3588-rga3
      arm64: dts: rockchip: add rga3 dt nodes
      arm64: dts: rockchip: increase rga3 clock speed
      media: rockchip: rga: add rga3 support

 .../devicetree/bindings/media/rockchip-rga.yaml    |   1 +
 arch/arm64/boot/dts/rockchip/rk3588-base.dtsi      |  50 +++
 drivers/media/platform/rockchip/rga/Makefile       |   2 +-
 drivers/media/platform/rockchip/rga/rga-buf.c      |  78 ++--
 drivers/media/platform/rockchip/rga/rga-hw.c       | 356 ++++++++++++---
 drivers/media/platform/rockchip/rga/rga-hw.h       |  15 +-
 drivers/media/platform/rockchip/rga/rga.c          | 404 ++++++-----------
 drivers/media/platform/rockchip/rga/rga.h          |  74 ++--
 drivers/media/platform/rockchip/rga/rga3-hw.c      | 490
+++++++++++++++++++++
 drivers/media/platform/rockchip/rga/rga3-hw.h      | 186 ++++++++
 10 files changed, 1246 insertions(+), 410 deletions(-)
---
base-commit: afb100a5ea7a13d7e6937dcd3b36b19dc6cc9328
change-id: 20251001-spu-rga3-8a00e018b120

Best regards,