Re: [PATCH v13 1/2] drm/tegra: dc: Support memory bandwidth management

From: Michał Mirosław
Date: Wed Mar 03 2021 - 20:10:52 EST


On Tue, Mar 02, 2021 at 03:44:44PM +0300, Dmitry Osipenko wrote:
> Display controller (DC) performs isochronous memory transfers, and thus,
> has a requirement for a minimum memory bandwidth that shall be fulfilled,
> otherwise framebuffer data can't be fetched fast enough and this results
> in a DC's data-FIFO underflow that follows by a visual corruption.
>
> The Memory Controller drivers provide facility for memory bandwidth
> management via interconnect API. Let's wire up the interconnect API
> support to the DC driver in order to fix the distorted display output
> on T30 Ouya, T124 TK1 and other Tegra devices.

I did a read on the code. I have put some thoughts and nits inbetween the
diff, but here are more general questions about the patch:

Is there a reason why the bandwidth is allocated per plane instead of just
using one peak and average for the whole configuration? Or is there a need
to expose all the planes as interconnect channels and allocate their
bandwidth individually?

>From algorithmic part I see that the plane overlaps are calculated twice
(A vs B and later B vs A). The cursor plane is ignored, but nevertheless
its overlap mask is calculated before being thrown away. The bandwidths
are also calculated twice: once for pre-commit and then again for
post-commit. Is setting bandwidth for an interconnect expensive when
re-setting a value that was already set? The code seems to avoid this
case, but maybe unnecessarily?

[...cut the big and interesting part...]

[...]
> @@ -65,7 +75,9 @@ struct tegra_dc_soc_info {
> unsigned int num_overlay_formats;
> const u64 *modifiers;
> bool has_win_a_without_filters;
> + bool has_win_b_vfilter_mem_client;
> bool has_win_c_without_vert_filter;
> + unsigned int plane_tiled_memory_bandwidth_x2;

This looks more like bool in the code using it.

[...]
> --- a/drivers/gpu/drm/tegra/plane.c
> +++ b/drivers/gpu/drm/tegra/plane.c
[...]
> +static int tegra_plane_check_memory_bandwidth(struct drm_plane_state *state)

The function's body looks more like 'update' or 'recalculate' rather
than 'check' the memory bandwidth.

> + struct tegra_plane_state *tegra_state = to_tegra_plane_state(state);
> + unsigned int i, bpp, bpp_plane, dst_w, src_w, src_h, mul;
> + const struct tegra_dc_soc_info *soc;
> + const struct drm_format_info *fmt;
> + struct drm_crtc_state *crtc_state;
> + u32 avg_bandwidth, peak_bandwidth;
> +
> + if (!state->visible)
> + return 0;
> +
> + crtc_state = drm_atomic_get_new_crtc_state(state->state, state->crtc);
> + if (!crtc_state)
> + return -EINVAL;
> +
> + src_w = drm_rect_width(&state->src) >> 16;
> + src_h = drm_rect_height(&state->src) >> 16;
> + dst_w = drm_rect_width(&state->dst);
> +
> + fmt = state->fb->format;
> + soc = to_tegra_dc(state->crtc)->soc;
> +
> + /*
> + * Note that real memory bandwidth vary depending on format and
> + * memory layout, we are not taking that into account because small
> + * estimation error isn't important since bandwidth is rounded up
> + * anyway.
> + */
> + for (i = 0, bpp = 0; i < fmt->num_planes; i++) {
> + bpp_plane = fmt->cpp[i] * 8;

Nit: you could declare the bpp_plane here.

> + /*
> + * Sub-sampling is relevant for chroma planes only and vertical
> + * readouts are not cached, hence only horizontal sub-sampling
> + * matters.
> + */
> + if (i > 0)
> + bpp_plane /= fmt->hsub;
> +
> + bpp += bpp_plane;
> + }
> +
> + /*
> + * Horizontal downscale takes extra bandwidth which roughly depends
> + * on the scaled width.
> + */
> + if (src_w > dst_w)
> + mul = (src_w - dst_w) * bpp / 2048 + 1;
> + else
> + mul = 1;

Does it really need more bandwidth to scale down? Does it read the same
data multiple times just to throw it away?

> + /* average bandwidth in bytes/s */
> + avg_bandwidth = src_w * src_h * bpp / 8 * mul;
> + avg_bandwidth *= drm_mode_vrefresh(&crtc_state->mode);
> +
> + /* mode.clock in kHz, peak bandwidth in kbit/s */
> + peak_bandwidth = crtc_state->mode.clock * bpp * mul;

I would guess that (src_w * bpp / 8) needs rounding up unless the plane
is continuous. Or you could just add the max rounding error here and
get a safe overestimated value. Maybe this would be better done when
summing per-plane widths.

> + /* ICC bandwidth in kbyte/s */
> + peak_bandwidth = kbps_to_icc(peak_bandwidth);
> + avg_bandwidth = Bps_to_icc(avg_bandwidth);

This could be merged with the assignments after the following 'if' block.

> + /*
> + * Tegra30/114 Memory Controller can't interleave DC memory requests
> + * and DC uses 16-bytes atom for the tiled windows, while DDR3 uses 32
> + * bytes atom. Hence there is x2 memory overfetch for tiled framebuffer
> + * and DDR3 on older SoCs.
> + */
> + if (soc->plane_tiled_memory_bandwidth_x2 &&
> + tegra_state->tiling.mode == TEGRA_BO_TILING_MODE_TILED) {
> + peak_bandwidth *= 2;
> + avg_bandwidth *= 2;
> + }
> +
> + tegra_state->peak_memory_bandwidth = peak_bandwidth;
> + tegra_state->avg_memory_bandwidth = avg_bandwidth;
> +
> + return 0;
> +}

[...]
> +static const char * const tegra_plane_icc_names[] = {
> + "wina", "winb", "winc", "", "", "", "cursor",
> +};
> +
> +int tegra_plane_interconnect_init(struct tegra_plane *plane)
> +{
> + const char *icc_name = tegra_plane_icc_names[plane->index];

Is plane->index guaranteed to be <= 6? I would guess so, but maybe
BUILD_BUG_ON(sizeof(icc_names)==TEGRA_DC_LEGACY_PLANES_NUM) or some
other check could document this?

[...]

Best Regards
Michał Mirosław