Re: [PATCH v11] drm: Add initial ci/ subdirectory

From: Daniel Stone
Date: Thu Sep 07 2023 - 11:33:19 EST


Hi,

On 04/09/2023 09:54, Daniel Vetter wrote:
On Wed, 30 Aug 2023 at 17:14, Helen Koike <helen.koike@xxxxxxxxxxxxx> > wrote: >> >> On 30/08/2023 11:57, Maxime Ripard wrote: >>> >>> I
agree that we need a baseline, but that baseline should be >>> defined by the tests own merits, not their outcome on a >>> particular platform. >>> >>> In other words, I want all drivers to follow that baseline, and >>> if they don't it's a bug we should fix, and we should be vocal >>> about it. We shouldn't ignore the test because it's broken. >>> >>> Going back to the example I used previously, >>> kms_hdmi_inject@inject-4k shouldn't fail on mt8173, ever. That's >>> a bug. Ignoring it and reporting that "all tests are good" isn't >>> ok. There's something wrong with that driver and we should fix >>> it. >>> >>> Or at the very least, explain in much details what is the >>> breakage, how we noticed it, why we can't fix it, and how to >>> reproduce it. >>> >>> Because in its current state, there's no chance we'll ever go >>> over that test list and remove some of them. Or even know if, if >>> we ever fix a bug somewhere, we should remove a flaky or failing >>> test. >>> >>> [...] >>> >>>> we need to have a clear view about which tests are not >>>> corresponding to it, so we can start fixing. First we need to >>>> be aware of the issues so we can start fixing them, otherwise >>>> we will stay in the "no tests no failures" ground :) >>> >>> I think we have somewhat contradicting goals. You want to make >>> regression testing, so whatever test used to work in the past >>> should keep working. That's fine, but it's different from >>> "expectations about what the DRM drivers are supposed to pass in >>> the IGT test suite" which is about validation, ie "all KMS >>> drivers must behave this way". >> >> [...] >> >> >> We could have some policy: if you want to enable a certain device >> in the CI, you need to make sure it passes all tests first to force >> people to go fix the issues, but maybe it would be a big barrier. >> >> I'm afraid that, if a test fail (and it is a clear bug), people >> would just say "work for most of the cases, this is not a priority >> to fix" and just start ignoring the CI, this is why I think >> regression tests is a good way to start with. > > I think eventually we need to get to both goals, but currently > driver and test quality just isn't remotely there. > > I think a good approach would be if CI work focuses on the pure sw > tests first, so kunit and running igt against vgem/vkms. And then we > could use that to polish a set of must-pass igt testcases, which > also drivers in general are supposed to pass. Plus ideally weed out > the bad igts that aren't reliable enough or have bad assumptions. > > For hardware I think it will take a very long time until we get to a > point where CI can work without a test result list, we're nowhere > close to that. But for virtual driver this really should be > achievable, albeit with a huge amount of effort required to get > there I think.
Yeah, this is what our experience with Mesa (in particular) has taught us.

Having 100% of the tests pass 100% of the time on 100% of the platforms is a great goal that everyone should aim for. But it will also never happen.

Firstly, we're just not there yet today. Every single GPU-side DRM driver has userspace-triggerable faults which cause occasional errors in GL/Vulkan tests. Every single one. We deal with these in Mesa by retrying; if we didn't retry, across the breadth of hardware we test, I'd expect 99% of should-succeed merges to fail because of these intermittent bugs in the DRM drivers. We don't have the same figure for KMS - because we don't test it - but I'd be willing to bet no driver is 100% if you run tests often enough.

Secondly, we will never be there. If we could pause for five years and sit down making all the current usecases for all the current hardware on the current kernel run perfectly, we'd probably get there. But we can't: there's new hardware, new userspace, and hundreds of new kernel trees. Even without the first two, what happens when the Arm SMMU maintainers (choosing a random target to pick on, sorry Robin) introduce subtle breakage which makes a lot of tests fail some of the time? Do we refuse to backmerge Linus into DRM until it's fixed, or do we disable all testing on Arm until it's fixed? When we've done that, what happens when we re-enable testing, and discover that a bunch of tests get broken because we haven't been testing?

Thirdly, hardware is capricious. 'This board doesn't make it to u-boot' is a clear infrastructure error, but if you test at sufficient scale, cold solder or failing caps surface way more often than you might think. And you can't really pick those out by any other means than running at scale, dealing with non-binary results, and looking at the trends over time. (Again this is something we do in Mesa - we graph test failures per DUT, look for outliers, and pull DUTs out of the rotation when they're clearly defective. But that only works if you actually run enough tests on them in the first place to discover trends - if you stop at the first failed test, it's impossible to tell the difference between 'infuriatingly infrequent kernel/test bug?' and 'cracked main board maybe?'.)

What we do know is that we _can_ classify tests four ways in expectations. Always-passing tests should always pass. Always-failing tests should always fail (and update the expectations if you make them pass). Flaking tests work often enough that they'll always pass if you run them a couple/few times, but fail often enough that you can't rely on them. Then you just skip tests which exhibit catastrophic failure i.e. local DoS which affects the whole test suite.

By keeping those sets of expectations, we've been able to keep Mesa pretty clear of regressions, whilst having a very clear set of things that should be fixed to point to. It would be great if those set of things were zero, but it just isn't. Having that is far better than the two alternatives: either not testing at all (obviously bad), or having the test always be red so it's always ignored (might as well just not test).


Cheers,

Daniel