Re: [PATCH 2/5] rust: device: Add a minimal RawDevice trait

From: Asahi Lina
Date: Fri Feb 24 2023 - 10:42:58 EST


On 25/02/2023 00.24, Greg Kroah-Hartman wrote:
>> What do you recommend for things that want to print device-associated
>> messages, if not holding a reference to the device?
>
> If you aren't holding a reference to the device, that means you aren't
> associated to it at all, so you better not be printing out anything
> related to any device as that pointer could be invalid at any point in
> time.

The RTKit code talks to the firmware coprocessor that is part of the
device, so it definitely is associated to it... among other things it
prints out firmware logs from the device and crash logs, manages memory
buffers (which have default implementations but can be overridden by the
user driver), and more. It's essentially library code shared by all
device drivers that interact with devices with these coprocessors.

>> Or did I
>> misunderstand what you meant? Just pr_foo() isn't great because we have
>> a lot of instances of rtkit and then you wouldn't know which device the
>> messages are about...
>
> Then the rtkit code needs to be changed to properly grab the reference
> and actually use it for something other than just a log message. If it
> only wants it for a log message, then let's just drop it and have the
> rtkit code go quiet, as when kernel code is working properly, it should
> be quiet. If something goes wrong, the code that called into rtkit can
> print out a message based on the error return values.

Keep in mind rtkit does things like print out crash logs, and you
wouldn't want the caller to be responsible for that (and we definitely
want those in dmesg since these coprocessors are non-recoverable, it's
almost as bad as a kernel panic: you will have to reboot to be able to
use the machine properly again). I find those crash logs very useful to
figure out what went wrong with the GPU (especially if combined with a
memory dump which we don't expose to regular users right now, but which
I have ideas for... but even without that, just assert messages from the
coprocessor or fault instruction pointers that I can correlate with the
firmware are very useful on their own).

Right now rtkit also prints out syslogs from the coprocessors. That's
noisy for some but I think very useful, since we're dealing with reverse
engineered drivers. We'll probably want to silence those for some noisy
coprocessors at some point, but I don't think we want to do that until
things are all upstream, stable, and with a larger user base... until
then I think we'd much rather be spammy and have a better chance of
debugging rare issues, which often happen with these coprocessors
running big firmware blobs... there are a lot of subtleties in getting
the interfaces right, never mind cache coherence issues!

> I have no idea what "rtkit" is, if it's an interface to hardware, why
> doesn't it have its own struct device that it creates and manages and
> uses instead? In my quick glance, that feels like the real solution
> here instead of just "I hope this pointer is going to be valid" like it
> lives with today. Odds are you can't remove a rtkit device at runtime,
> so no one has noticed this yet...

Well, they're all embedded into the SoC, yes.

RTKit is Apple's firmware RTOS, and also the name for the
semi-standardized mailbox/shared-memory interface shared by different
firmwares using it. The Linux code to drive it doesn't create its own
"struct device" because the rtkit code is just library code that is
extended by the downstream drivers (like mine). How each driver
interacts with rtkit varies widely... NVMe almost doesn't at all other
than for power management, there is actually a downstream "rtkit-helper"
driver that is a proper standalone device wrapper for one case (MTP)
where it really doesn't need to interact at all... in my case with the
GPU, almost everything is shared memory and doorbells over the mailbox.
Other drivers like DCP actually send pointers over multiple mailbox
endpoints, or do most of their data exchange directly over messages like
that (SMC).

So in a way, if we consider it driver library code, it's not
unreasonable for RTKit to require that the device you pass it outlives
it. Certainly, if the device is getting unbound from your driver, you'd
need to tear down RTKit as part of that in any reasonable situation.

>> I know it's hard to review without examples, but I also can't just post
>> the driver and everything else as one series now, there's still a lot to
>> be improved and fixed and I'm working with the Rust folks on figuring
>> out a roadmap for that... and waiting until "everything" is ready and
>> perfect would mean we don't get anything done in the meantime and fall
>> into a pit of endless rebasing and coordinating downstream trees, which
>> also isn't good...
>
> Yeah, it's a chicken and egg issue right now, no worries, I understand.
> This is going to take some cycles to get right.
>

Thank you ^^

~~ Lina