Re: [PATCH V4 18/20] rv/monitor: Add safe watchdog monitor

From: Guenter Roeck
Date: Fri Jul 01 2022 - 10:46:02 EST


On 6/28/22 12:32, Steven Rostedt wrote:
[ ... ]
I am open to suggestions from others, but at this point I have serious doubts
that this code is maintainable in the kernel.


Let me give some background on this. Various safety critical users
(automotive, power plants, etc) are looking at using Linux in the field.
Obviously, Linux itself is too big to verify properly. In fact, even small
OSs that say they are verifiable seldom are. Part of the solution is to add
a way to check that Linux is running as expected, and be able to detect
when it is not. When that happens, you fall into a "fail-safe" mode (for
example, if you have an autonomous driving vehicle, it could sound an alarm
and inform the driver to go into manual mode, or it is going to simply
"pull over". Note, this is just a hypothetical example, not something I've
seen in reality).

But to implement this, you need to verify the verifier. In this case, the
watchdog timer is what verifies that Linux is running properly. It is
possible to verify parts of the kernel when you limit the inputs of the
kernel. This means we are not trying to verify all use cases of the
watchdog timer. That would be pretty much impossible, or take decades to
complete. We only need to verify the use cases used in safety critical
systems.


I understand the context. However, ...

Do not confuse this with static analyzers or other general purpose tooling
to find bugs. This is not the purpose of the run time verify. It is just to
prove that certain use cases will perform as expected, given a limited
input.


... this is, unfortunately, not explained in the patch. I would have much less
of a problem with the series if those details were included.

Not that I would mind such a verifier, if it was possible to define one,
but it would have to be tested with a large number of watchdog drivers
to ensure that it addresses all use cases, or at least with a substantial
percentage of use cases. It would also require that the state machine is
readable to give people a chance to fix it if turns out to be broken.
It would also have to be robust, meaning it would have to reject invalid
(unsupported) settings from the start and not only during runtime.

Thanks,
Guenter