On 7/25/23 18:16, Faith Ekstrand wrote:
Thanks for the detailed write-up! That would definitely explain it. If I remember, I'll try to do a single-threaded run or two. If your theory is correct, there should be no real perf difference when running single-threaded. Those runs will take a long time, though, so I'll have to run them over night. I'll let you know in a few days once I have the results.
I can also push a separate branch where I just print out a warning whenever we run into such a condition including the time we were waiting for things to complete. I can probably push something later today.
If this theory holds, then I'm not concerned about the performance of the API itself. It would still be good to see if we can find a way to reduce the cross-process drag in the implementation but that's a perf optimization we can do later.
From the kernel side I think the only thing we could really do is to temporarily run a secondary drm_gpu_scheduler instance, one for VM_BINDs and one for EXECs until we got the new page table handling in place.
However, the UMD could avoid such conditions more effectively, since it controls the address space. Namely, avoid re-using the same region of the address space right away in certain cases. For instance, instead of replacing a sparse region A[0x0, 0x4000000] with a larger sparse region B[0x0, 0x8000000], replace it with B'[0x4000000, 0xC000000] if possible.
However, just mentioning this for completeness. The UMD surely shouldn't probably even temporarily work around such a kernel limitation.
Anyway, before doing any of those, let's see if the theory holds and we're actually running into such cases.
Does it actually matter? Yes, it kinda does. No, it probably doesn't matter for games because you're typically only running one game at a time. From a development PoV, however, if it makes CI take longer then that slows down development and that's not good for the users, either.
Fully agree.
- Danilo
~Faith
- Danilo
>
> ~Faith
>