Re: [PATCH v15 1/2] drm/tegra: dc: Support memory bandwidth management

From: Michał Mirosław
Date: Sun Mar 14 2021 - 18:33:34 EST


On Thu, Mar 11, 2021 at 08:22:54PM +0300, Dmitry Osipenko wrote:
> Display controller (DC) performs isochronous memory transfers, and thus,
> has a requirement for a minimum memory bandwidth that shall be fulfilled,
> otherwise framebuffer data can't be fetched fast enough and this results
> in a DC's data-FIFO underflow that follows by a visual corruption.
[...]
> +static unsigned long
> +tegra_plane_overlap_mask(struct drm_crtc_state *state,
> + const struct drm_plane_state *plane_state)
> +{
> + const struct drm_plane_state *other_state;
> + const struct tegra_plane *tegra;
> + unsigned long overlap_mask = 0;
> + struct drm_plane *plane;
> + struct drm_rect rect;
> +
> + if (!plane_state->visible || !plane_state->fb)
> + return 0;
> +
> + drm_atomic_crtc_state_for_each_plane_state(plane, other_state, state) {
[...]
> + /*
> + * Data-prefetch FIFO will easily help to overcome temporal memory
> + * pressure if other plane overlaps with the cursor plane.
> + */
> + if (tegra_plane_is_cursor(plane_state) && overlap_mask)
> + return 0;
> +
> + return overlap_mask;
> +}

Since for cursor plane this always returns 0, you could test
tegra_plane_is_cursor() at the start of the function.

> +static int tegra_crtc_calculate_memory_bandwidth(struct drm_crtc *crtc,
> + struct drm_atomic_state *state)
[...]
> + /*
> + * For overlapping planes pixel's data is fetched for each plane at
> + * the same time, hence bandwidths are accumulated in this case.
> + * This needs to be taken into account for calculating total bandwidth
> + * consumed by all planes.
> + *
> + * Here we get the overlapping state of each plane, which is a
> + * bitmask of plane indices telling with what planes there is an
> + * overlap. Note that bitmask[plane] includes BIT(plane) in order
> + * to make further code nicer and simpler.
> + */
> + drm_atomic_crtc_state_for_each_plane_state(plane, plane_state, new_state) {
> + tegra_state = to_const_tegra_plane_state(plane_state);
> + tegra = to_tegra_plane(plane);
> +
> + if (WARN_ON_ONCE(tegra->index >= TEGRA_DC_LEGACY_PLANES_NUM))
> + return -EINVAL;
> +
> + plane_peak_bw[tegra->index] = tegra_state->peak_memory_bandwidth;
> + mask = tegra_plane_overlap_mask(new_state, plane_state);
> + overlap_mask[tegra->index] = mask;
> +
> + if (hweight_long(mask) != 3)
> + all_planes_overlap_simultaneously = false;
> + }
> +
> + old_state = drm_atomic_get_old_crtc_state(state, crtc);
> + old_dc_state = to_const_dc_state(old_state);
> +
> + /*
> + * Then we calculate maximum bandwidth of each plane state.
> + * The bandwidth includes the plane BW + BW of the "simultaneously"
> + * overlapping planes, where "simultaneously" means areas where DC
> + * fetches from the planes simultaneously during of scan-out process.
> + *
> + * For example, if plane A overlaps with planes B and C, but B and C
> + * don't overlap, then the peak bandwidth will be either in area where
> + * A-and-B or A-and-C planes overlap.
> + *
> + * The plane_peak_bw[] contains peak memory bandwidth values of
> + * each plane, this information is needed by interconnect provider
> + * in order to set up latency allowness based on the peak BW, see
> + * tegra_crtc_update_memory_bandwidth().
> + */
> + for (i = 0; i < ARRAY_SIZE(plane_peak_bw); i++) {
> + overlap_bw = 0;
> +
> + for_each_set_bit(k, &overlap_mask[i], 3) {
> + if (k == i)
> + continue;
> +
> + if (all_planes_overlap_simultaneously)
> + overlap_bw += plane_peak_bw[k];
> + else
> + overlap_bw = max(overlap_bw, plane_peak_bw[k]);
> + }
> +
> + new_dc_state->plane_peak_bw[i] = plane_peak_bw[i] + overlap_bw;
> +
> + /*
> + * If plane's peak bandwidth changed (for example plane isn't
> + * overlapped anymore) and plane isn't in the atomic state,
> + * then add plane to the state in order to have the bandwidth
> + * updated.
> + */
> + if (old_dc_state->plane_peak_bw[i] !=
> + new_dc_state->plane_peak_bw[i]) {
> + plane = tegra_crtc_get_plane_by_index(crtc, i);
> + if (!plane)
> + continue;
> +
> + plane_state = drm_atomic_get_plane_state(state, plane);
> + if (IS_ERR(plane_state))
> + return PTR_ERR(plane_state);
> + }
> + }
[...]

Does it matter to which channel (plane) the peak bw is attached? Would
it still work if the first channel specified max(peak_bw of overlaps)
and others only zeroes?

> + /*
> + * Horizontal downscale needs a lower memory latency, which roughly
> + * depends on the scaled width. Trying to tune latency of a memory
> + * client alone will likely result in a strong negative impact on
> + * other memory clients, hence we will request a higher bandwidth
> + * since latency depends on bandwidth. This allows to prevent memory
> + * FIFO underflows for a large plane downscales, meanwhile allowing
> + * display to share bandwidth fairly with other memory clients.
> + */
> + if (src_w > dst_w)
> + mul = (src_w - dst_w) * bpp / 2048 + 1;
> + else
> + mul = 1;
[...]

One point is unexplained yet: why is the multiplier proportional to a
*difference* between src and dst widths? Also, I would expect max (worst
case) is pixclock * read_size when src_w/dst_w >= read_size.

BTW, you could move this below and : if (src > dst_w) peak_bandwidth *= ...

> + /* average bandwidth in bytes/s */
> + avg_bandwidth = (bpp * src_w * src_h * mul + 7) / 8;
> + avg_bandwidth *= drm_mode_vrefresh(&crtc_state->mode);
> +
> + /* mode.clock in kHz, peak bandwidth in kbit/s */
> + peak_bandwidth = crtc_state->mode.clock * bpp * mul;
[...]

Best Regards
Michał Mirosław