Re: [git pull] drm for 5.8-rc1

From: James Jones
Date: Fri Jul 03 2020 - 02:01:53 EST


On 7/2/20 2:14 PM, James Jones wrote:
On 7/2/20 1:22 AM, Daniel Stone wrote:
Hi,

On Wed, 1 Jul 2020 at 20:45, James Jones <jajones@xxxxxxxxxx> wrote:
OK, I think I see what's going on. In the Xorg modesetting driver, the
logic is basically:

if (gbm_has_modifiers && DRM_CAP_ADDFB2_MODIFIERS != 0) {
ÂÂÂ drmModeAddFB2WithModifiers(..., gbm_bo_get_modifier(bo->gbm));
} else {
ÂÂÂ drmModeAddFB(...);
}

I read this thread expecting to explain the correct behaviour we
implement in Weston and how modesetting needs to be fixed, but ...
that seems OK to me? As long as `gbm_has_modifiers` is a proxy for 'we
used gbm_(bo|surface)_create_with_modifiers to allocate the buffer'.

Yes, the hazards of reporting findings before verifying. I now see modesetting does query the DRM-KMS modifiers and attempt to allocate with them if it found any. However, I still see a lot of ways things can go wrong, but I'm not going to share my speculation again until I've actually verified it, which is taking a frustratingly long time. The modesetting driver is not my friend right now.

OK, several hours of dumb build+config mistakes later, I was actually able to reproduce the failure and walk through things. There is a trivial fix for the issues in the X modesetting driver, working off Daniel Stone's claim that gbm_bo_get_modifier() should only be called when the allocation was made with gbm_bo_create_with_modifiers(). modeset doesn't respect that requirement now in the case that the atomic modesetting path is disabled, which is always the case currently because that path is broken. Respecting that requirement is a half-liner and allows X to start properly.

If I force modeset to use the atomic path, X still fails to start with the above fix, validating the second theory I'd had:

-Current Mesa nouveau code basically ignores the modifier list passed in unless it is a single modifier requesting linear layout, and goes about allocating whatever layout it sees fit, and succeeds the allocation despite being passed a list of modifiers it knows nothing about. Not great, fixed in my pending patches, obviously doesn't help existing deployed userspace.

-Current Mesa nouveau code, when asked what modifier it used for the above allocation, returns one of the "legacy" modifiers nouveau DRM-KMS knows nothing about.

-When the modeset driver tries to create an FB for that BO with the returned modifier, the nouveau kernel driver of course refuses.

I think it's probably worth fixing the modesetting driver for the reasons Daniel Vetter mentioned. Then if I get my Mesa patches in before a new modesetting driver with working Atomic support is released, there'll be no need for long-term workarounds in the kernel.

Down to the real question of what to do in the kernel to support current userspace code: I still think the best fix is to accept the old modifiers but not advertise them. However, Daniel Stone and others, if you think this will actually break userspace in other ways (Could you describe in a bit more detail or point me to test cases if so?), I suppose the only option would be to advertise & accept the old modifiers for now, and I suppose at a config option at some point to phase the old ones out, eventually drop them entirely. This would be unfortunate, because as I mentioned, it could sometimes result in situations where apps think they can share a buffer between two devices but will get garbled data in practice.

I've included an initial version of the kernel patch inline below. Needs more testing, but I wanted to share it in case anyone has feedback on the idea, wants to see the general workflow, or wants to help test.

There's no attempt to verify the DRM-KMS device supports the modifier,
but then, why would there be? GBM presumably chose a supported modifier
at buffer creation time, and we don't know which plane the FB is going
to be used with yet. GBM doesn't actually ask the kernel which
modifiers it supports here either though.

Right, it doesn't ask, because userspace tells it which modifiers to
use. The correct behaviour is to take the list from the KMS
`IN_FORMATS` property and then pass that to
`gbm_(bo|surface)_create_with_modifiers`; GBM must then select from
that list and only that list. If that call does not succeed and Xorg
falls back to `gbm_surface_create`, then it must not call
`gbm_bo_get_modifier` - so that would be a modesetting bug. If that
call does succeed and `gbm_bo_get_modifier` subsequently reports a
modifier which was not in the list, that's a Mesa driver bug.

It just goes into Mesa via
DRI and reports the modifier (unpatched) Mesa chose on its own. Mesa
just hard-codes the modifiers in its driver backends since its thinking
in terms of a device's 3D engine, not display. In theory, Mesa's DRI
drivers could query KMS for supported modifiers if allocating from GBM
using the non-modifiers path and the SCANOUT flag is set (perhaps some
drivers do this or its equivalent? Haven't checked.), but that seems
pretty gnarly and doesn't fix the modifier-based GBM allocation path
AFAIK. Bit of a mess.

Two options for GBM users:
* call gbm_*_create_with_modifiers, it succeeds, call
gbm_bo_get_modifier, pass modifier into AddFB
* call gbm_*_create (without modifiers), it succeeds, do not call
gbm_bo_get_modifier, do not pass a modifier into AddFB

Anything else is a bug in the user. Note that falling back from 1 to 2
is fine: if `gbm_*_create_with_modifiers()` fails, you can fall back
to the non-modifier path, provided you don't later try to get a
modifier back out.

For a quick userspace fix that could probably be pushed out everywhere
(Only affects Xorg server 1.20+ AFAIK), just retrying
drmModeAddFB2WithModifiers() without the DRM_MODE_FB_MODIFIERS flag on
failure should be sufficient.

This would break other drivers.

I think this could be done in a way that wouldn't, though it wouldn't be quite as simple. Let's see what the true root cause is first though.

Still need to verify as I'm having
trouble wrangling my Xorg build at the moment and I'm pressed for time.
A more complete fix would be quite involved, as modesetting isn't really
properly plumbed to validate GBM's modifiers against KMS planes, and it
doesn't seem like GBM/Mesa/DRI should be responsible for this as noted
above given the general modifier workflow/design.

Most importantly, options I've considered for fixing from the kernel side:

-Accept "legacy" modifiers in nouveau in addition to the new modifiers,
though avoid reporting them to userspace as supported to avoid further
proliferation. This is pretty straightforward. I'll need to modify
both the AddFB2 handler (nouveau_validate_decode_mod) and the mode set
plane validation logic (nv50_plane_format_mod_supported), but it should
end up just being a few lines of code.

I do think that they should also be reported to userspace if they are
accepted. Other users can and do look at the modifier list to see if
the buffer is acceptable for a given plane, so the consistency is good
here. Of course, in Mesa you would want to prioritise the new
modifiers over the legacy ones, and not allocate or return the legacy
ones unless that was all you were asked for. This would involve
tracking the used modifier explicitly through Mesa, rather than
throwing it away at alloc time and then later divining it from the
tiling mode.

Reporting them as supported is equivalent to reporting support for a memory layout the chips don't actually support (It corresponds to a valid layout on Tegra chips, but not on discrete NV chips). This is what the new modifiers are trying to avoid in the first place: Implying buffers can be shared between these Tegra chips and discrete NV GPUs.

Thanks,
-James

Cheers,
Daniel


nouveau: Accept 'legacy' format modifiers

Accept the DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK()
family of modifiers to handle broken userspace
Xorg modesetting and Mesa drivers.
---
drivers/gpu/drm/nouveau/nouveau_display.c | 26 +++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
index 496c4621cc78..31543086254b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -191,8 +191,14 @@ nouveau_decode_mod(struct nouveau_drm *drm,
uint32_t *tile_mode,
uint8_t *kind)
{
+ struct nouveau_display *disp = nouveau_display(drm->dev);
BUG_ON(!tile_mode || !kind);

+ if ((modifier & (0xffull << 12)) == 0ull) {
+ /* Legacy modifier. Translate to this device's 'kind.' */
+ modifier |= disp->format_modifiers[0] & (0xffull << 12);
+ }
+
if (modifier == DRM_FORMAT_MOD_LINEAR) {
/* tile_mode will not be used in this case */
*tile_mode = 0;
@@ -227,6 +233,16 @@ nouveau_framebuffer_get_layout(struct drm_framebuffer *fb,
}
}

+static const u64 legacy_modifiers[] = {
+ DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK(0),
+ DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK(1),
+ DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK(2),
+ DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK(3),
+ DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK(4),
+ DRM_FORMAT_MOD_NVIDIA_16BX2_BLOCK(5),
+ DRM_FORMAT_MOD_INVALID
+};
+
static int
nouveau_validate_decode_mod(struct nouveau_drm *drm,
uint64_t modifier,
@@ -247,8 +263,14 @@ nouveau_validate_decode_mod(struct nouveau_drm *drm,
(disp->format_modifiers[mod] != modifier);
mod++);

- if (disp->format_modifiers[mod] == DRM_FORMAT_MOD_INVALID)
- return -EINVAL;
+ if (disp->format_modifiers[mod] == DRM_FORMAT_MOD_INVALID) {
+ for (mod = 0;
+ (legacy_modifiers[mod] != DRM_FORMAT_MOD_INVALID) &&
+ (legacy_modifiers[mod] != modifier);
+ mod++);
+ if (legacy_modifiers[mod] == DRM_FORMAT_MOD_INVALID)
+ return -EINVAL;
+ }

nouveau_decode_mod(drm, modifier, tile_mode, kind);

--
2.17.1