Re: [PATCH V4 20/20] Documentation/rv: Add watchdog-monitor documentation

From: Tao Zhou
Date: Thu Jul 07 2022 - 08:40:45 EST


On Thu, Jun 16, 2022 at 10:45:02AM +0200,
Daniel Bristot de Oliveira wrote:

> Adds documentation about the safe_wtd and safe_wtd_nwo RV monitors,
> and their usage via a safety application.
>
> Cc: Wim Van Sebroeck <wim@xxxxxxxxxxxxxxxxxx>
> Cc: Guenter Roeck <linux@xxxxxxxxxxxx>
> Cc: Jonathan Corbet <corbet@xxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Will Deacon <will@xxxxxxxxxx>
> Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
> Cc: Marco Elver <elver@xxxxxxxxxx>
> Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
> Cc: Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx>
> Cc: Gabriele Paoloni <gpaoloni@xxxxxxxxxx>
> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Cc: Clark Williams <williams@xxxxxxxxxx>
> Cc: linux-doc@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Cc: linux-trace-devel@xxxxxxxxxxxxxxx
> Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> ---
> Documentation/trace/rv/watchdog-monitor.rst | 250 ++++++++++++++++++++
> 1 file changed, 250 insertions(+)
> create mode 100644 Documentation/trace/rv/watchdog-monitor.rst
>
> diff --git a/Documentation/trace/rv/watchdog-monitor.rst b/Documentation/trace/rv/watchdog-monitor.rst
> new file mode 100644
> index 000000000000..2b142fb31572
> --- /dev/null
> +++ b/Documentation/trace/rv/watchdog-monitor.rst
> @@ -0,0 +1,250 @@
> +Watchdog monitor
> +----------------
> +
> +The watchdog is an essential building block for the usage of Linux in
> +safety-critical systems because it allows the system to be monitored from
> +an external element - the watchdog hardware, acting as a safety-monitor.
> +
> +A user-space application controls the watchdog device via the watchdog
> +interface. This application, hereafter safety_app, enables the watchdog
> +and periodically pets the watchdog upon correct completion of the safety
> +related processing.
> +
> +If the safety_app, for any reason, stops pinging the watchdog,
> +the watchdog hardware can set the system in a fail-safe state. For
> +example, shutting the system down.
> +
> +Given the importance of the safety_app / watchdog hardware couple,
> +the interaction between these software pieces also needs some
> +sort of monitoring. In other words, "who monitors the monitor?"
> +
> +The safe watchdog (safe_wtd) RV monitor monitors the interaction between
> +the safety_app and the watchdog device, enforcing the correct sequence of
> +events that leads the system to a safe state.
> +
> +Furthermore, the safety_app can monitor the RV monitor by collecting the
> +events generated by the RV monitor itself via tracing interface. In this way,
> +closing the monitoring loop with the safety_app.
> +
> +A diagram of the components and their interactions is::
> +
> + user-space:
> + +--------------------------------+
> + | safety_app |-----------+
> + +--------------------------------+ |
> + | ^ |
> + | Configure | Enable and |
> + | | check data |
> + ===================+====================+=============== |
> + kernel-space: | | |
> + v v |
> + +----------+ instr. +-------------+ |
> + | watchdog | ----------->| RV Monitor |----+ |
> + | device | +-------------+ | |
> + +----------+ | |
> + | | |
> + | | |
> + ================+====================================== | |
> + hardware: | | |
> + v | +-> Bring the system
> + +--------------------+ +----> to a safe state,
> + | watchdog hardware |---------------------------> e.g., halt.
> + +--------------------+
> +
> +Sample safety_app
> +-----------------
> +
> +The user-space safety_app sample code in ``tools/verification/safety_app/``
> +serves to illustrate the usage of the RV monitors for this use-case, as
> +well as the starting point to the development of a user-specific safety_app.
> +
> +Watchdog events
> +---------------
> +
> +The RV monitor observes the watchdog by using instrumentation to
> +process the events generated by the interaction between the
> +safety_app and the watchdog device layer in kernel.
> +
> +The monitored events are:
> +
> + - watchdog:watchdog_open: open the watchdog device;
> + - watchdog:watchdog_close: close the watchdog device;
> + - watchdog:watchdog_start: start the watchdog;
> + - watchdog:watchdog_stop: stop the watchdog;
> + - watchdog:watchdog_set_timeout: set the watchdog timeout;
> + - watchdog:watchdog_ping: reprogram the watchdog with the previously set
> + timeout;
> + - watchdog:watchdog_nowayout: prevents the watchdog from stopping;
> + - watchdog:watchdog_set_keep_alive: set an intermediary ping to overcome
> + the limitation of a hardware watchdog maximum timeout being shorter than
> + the timeout set by the user-space tool;
> + - watchdog:watchdog_keep_alive: the execution of the function that runs the
> + intermediary keep alive ping;
> +
> +RV monitor events
> +-----------------
> +
> +The RV monitor monitors the relevant events as an outside observer,
> +interpreting all the components (the hardware; the watchdog device
> +interface; and the safety monitor) as an integrated component.
> +
> +The events selected for the monitor are:
> +
> + - other_threads: an event generated by any thread other than the
> + one that set nowayout or open the watchdog the last time.
> + - open: a thread opens the watchdog to manipulate it;
> + - close: a thread closes the watchdog;
> + - start: starts the watchdog countdown;
> + - stop: stops the watchdog;
> + - set_safe_timeout: configures the watchdog with a given timeout;
> + - ping: resets the watchdog countdown with the previously configured timeout;
> + - nowayout: prevents the watchdog to be stopped until the system's shutdown;
> + - sched_keep_alive: schedules a kernel worker to ping the watchdog if the
> + timeout is longer than the watchdog hardware can handle.
> + - keep_alive: executes the previously scheduled watchdog ping;
> +
> +Noting that the events that does not appear in the automata models are
> +considered blocked events, and their execution will always cause the
> +RV monitor to react to an unexpected event.
> +
> +RV monitor specification
> +------------------------
> +
> +The monitor's goal is to assess a set of specifications that conducts the
> +system to a safe state.
> +
> +These specifications are:
> +
> + - 1: Once open, only one process manipulates the watchdog;
> + - 2: Following 1, the keep-alive mechanisms will not be used;
> + - 3: If required, nowayout will be set before opening the watchdog;
> + - 4: A safe timeout must be set;
> + - 5: At least one ping must be made before entering the safe/safe_nwo states
> + - 6: The RV monitor does not react if the watchdog is closed without stopping.
> + But the hardware watchdog is expected to react.
> +
> +Deterministic automata monitors
> +-------------------------------
> +
> +Following the specifications, a deterministic automata monitor
> +was developed. The monitor is modeled as Deterministic Automata model.
> +
> +The deterministic automata model for safe_wtd is::
> +
> + #==================================# other_threads
> + H H ----------------+
> + -----------> H init H |
> + H H <---------------+
> + #==================================#
> + | | ^
> + | | | close
> + | | +----------------------------------------------------+
> + | | |
> + | | open |
> + | +------------------------------------------------------+ |
> + | | |
> + | nowayout | |
> + v | |
> + nowayout +-------------------+ | |
> + other_threads | | nowayout | |
> + +---------------- | nwo |<-------------------------------------+ | |
> + | | | | | |
> + +---------------> | | <+ | | |
> + +-------------------+ | | | |
> + | | | | |
> + | open | close | | |
> + v | | | |
> + +-------------------+ | | | |
> + | opened_nwo | -+ | | |
> + +-------------------+ | | |
> + | | | |
> + | start | | |
> + v | | |
> + +-------------------+ | | |
> + +---------------> | started_nwo | -+ | | |
> + | +-------------------+ | | | |
> + | | | | | |
> + | open | set_safe_timeout | | | |
> + | v | | | |
> + | +-------------------+ | | | |
> + | | set_nwo | | | | |
> + | +-------------------+ | | | |
> + | | | | | |
> + | +-------------+ | ping | | | |
> + | | | | | | | |
> + | | ping v v | | | |
> + | | +-------------------+ | | | |
> + | +-----------| safe_nwo | | | | |
> + | +-------------------+ | | | |
> + | | | | | |
> + | | close | close | | |
> + | v v | | |
> + | +----------------------------------+ nowayout | | |
> + | | | other_threads | | |
> + | | closed_running_nwo | ----------------+ | | |
> + | | | | | | |
> + +---------------- | | <---------------+ | | |
> + +----------------------------------+ | | |
> + | nowayout ^ | | |
> + +-----------------------------+ | | |
> + | | |
> + | | |
> + +-------------------+ +--------+ | | |
> + | | | |------+---+ |
> + | started | start | opened | | |
> + +---------------- | | <-------- | |>-----+-------+
> + | +-------------------+ +--------+ | ^
> + | | | |
> + | | set_safe_timeout +-------------+-------+
> + | v | |
> + | +-------------------+ | |
> + | | | | |
> + | | set | | |
> + +----------+---------------> | | | |
> + | | +-------------------+ | |
> + | | | | |
> + | | | ping | |
> + | | v | |
> + | | +-------------------+ ping | |
> + | | | | -------+ | |
> + | | +---- | safe | | | |
> + | | | | | <------+ | |
> + | | | +-------------------+ | |
> + | | | | | |
> + | | stop | | stop | |
> + | | | v | |
> + | | | +-------------------+ close | |
> + | +-----------+---> | stopped |-------------+ |
> + | | +-------------------+ |
> + | +---+ |
> + | | close |
> + | v |
> + | other_threads +----------------------------------------+ |
> + | +--------------> | | |
> + | | | closed_running | |
> + | +--------------- | |--------------+
> + | +----------------------------------------+
> + | | ^
> + | open | | close
> + | v |
> + | set_safe_timeout +-------------------+
> + +-------------------------> | reopened |
> + +-------------------+

So I must check the model matrix in patch18 and the above graph to
check if the matrix have the not right value or something in this
graph go wrong.


open
o init state ------> opened state: there lack a direction arrow.


nwo
o closed_running state -------> closed_running_nwo state: this
is the state trasfer indicated from matrix, but the above
graph is not consistent with it. In graph:
nwo
closed_running state -------> nwo state. I prefer the state
closed_running_nwo.

nwo
other_thread
o closed_running_nwo state ---------------> closed_running_nwo state.
The graph above have a duplicate nwo event leading to state transfer.
open
closed_running_nwo state ---------------> started_nwo state. I think
it should transfer to opened_nwo state.



set_safe_timeout
o reopened state ------------------> set state: the direction is from
reopened state to set state. The graph above have two direction arrow.
stop
reopenped state ------------------> opened state. why stop event trigger
it to opened state not stoped state. I think it should transfer to stoped
state like the started state do.


o started state do not have link to closed_running state by close event in
graph. Like started_nwo state do, it should have a direct close event to
link to closed_running state.




Modified the above graph like this:

+
+ #==================================# other_threads
+ H H ----------------+
+ -----------> H init H |
+ H H <---------------+
+ #==================================#
+ | | ^
+ | | | close
+ | | +----------------------------------------------------+
+ | | |
+ | | open |
+ | +------------------------------------------------------+ |
+ | | |
+ | nowayout | |
+ v | |
+ nowayout +-------------------+ | |
+ other_threads | | | |
+ +---------------- | nwo | | |
+ | | | | |
+ +---------------> | | <+ | |
+ +-------------------+ | | |
+ | | | |
+ | open | close | |
+ v | | |
+ +-------------------+ | | |
+ +---------------> | opened_nwo | -+ | |
+ | +-------------------+ | |
+ | | | |
+ | | start | |
+ | v | |
+ | +-------------------+ | |
+ | | started_nwo | -+ | |
+ | +-------------------+ | | |
+ | | | | |
+ | open | set_safe_timeout | | |
+ | v | | |
+ | +-------------------+ | | |
+ | | set_nwo | | | |
+ | +-------------------+ | | |
+ | | | | |
+ | +-------------+ | ping | | |
+ | | | | | | |
+ | | ping v v | | |
+ | | +-------------------+ | | |
+ | +-----------| safe_nwo | | | |
+ | +-------------------+ | | |
+ | | | | |
+ | | close | close | |
+ | v v | |
+ | +----------------------------------+ nowayout | |
+ | | | other_threads | |
+ | | closed_running_nwo | ----------------+ | |
+ | | | | | |
+ +---------------- | | <---------------+ | |
+ +----------------------------------+ | |
+ ^ | |
+ | nowayout | |
+ +--------------------------------------+ | |
+ | | |
+ +-------------------+ +--------+ | | |
+ | | start | |<----+---+ |
+ | started | <--------- | opened | | |
+ +---------------- | | ---------+ | |>----+-------+
+ | +-------------------+ close | +--------+ | ^
+ | | | | |
+ | | set_safe_timeout | +-----------+-------+
+ | v | | |
+ | +-------------------+ | | |
+ | | | | | |
+ | | set | | | |
+ +----------+---------------> | | | | |
+ | | +-------------------+ | | |
+ | | | | | |
+ | | | ping | | |
+ | | v | | |
+ | | +-------------------+ ping | | |
+ | | | | -------+ | | |
+ | | +---- | safe | | | | |
+ | | | | | <------+ | | |
+ | | | +-------------------+ | | |
+ | | | | | | |
+ | | stop | | stop | | |
+ | | | v | | |
+ | | | +-------------------+ close | | |
+ | +-----------+---> | stopped |----------+----+ |
+ | | +-------------------+ | |
+ | +---+ ^ | |
+ | | close | | |
+ | | +------------------+------+ |
| v v | |
+ | other_threads +----------------------------------------+ | |
+ | +--------------> | | | |
+ | | | closed_running | | |
+ | +--------------- | |----+---------+
+ | +----------------------------------------+ |
+ | | ^ |
+ | open | | close |stop
+ | v | |
+ | set_safe_timeout +-------------------+ |
+ +-------------------------- | reopened |------------------+
+ +-------------------+



Thanks
> +It is important to note that the events sched_keep_alive and keep_alive
> +are not allowed in the monitor (they are said to be blocked events).
> +The execution of any blocked events leads the RV monitor to react.
> +
> +Additional options
> +------------------
> +
> +The RV monitor also has a set of options enabled via kernel command
> +line/module options. They are:
> +
> + - watchdog_id: the device id to monitor (default 0);
> + - dont_stop: once enabled, do not allow the RV monitor to be stopped (default off);
> + - safe_timeout: define a maximum safe value that a user-space application can
> + set as the watchdog timeout (default unlimited);
> + - check_timeout: After every ping, check if the time left in the watchdog is less
> + than or equal to the last timeout set for the watchdog. It only works for watchdog
> + devices that provide the get_timeleft() function (default off);
> --
> 2.35.1
>