Re: [PATCH v2 5/5] vfio: ccw: add traceponits for interesting error paths

From: Halil Pasic
Date: Mon Apr 30 2018 - 10:14:37 EST




On 04/30/2018 01:51 PM, Cornelia Huck wrote:
On Sat, 28 Apr 2018 13:50:23 +0800
Dong Jia Shi <bjsdjshi@xxxxxxxxxxxxxxxxxx> wrote:

* Cornelia Huck <cohuck@xxxxxxxxxx> [2018-04-27 12:13:53 +0200]:

On Mon, 23 Apr 2018 13:01:13 +0200
Dong Jia Shi <bjsdjshi@xxxxxxxxxxxxxxxxxx> wrote:

typo in subject: s/traceponits/tracepoints/
From: Halil Pasic <pasic@xxxxxxxxxxxxxxxxxx>

Add some tracepoints so we can inspect what is not working as is should.

Signed-off-by: Halil Pasic <pasic@xxxxxxxxxxxxxxxxxx>
Signed-off-by: Dong Jia Shi <bjsdjshi@xxxxxxxxxxxxxxxxxx>
---
drivers/s390/cio/Makefile | 1 +
drivers/s390/cio/vfio_ccw_fsm.c | 16 +++++++-
drivers/s390/cio/vfio_ccw_trace.h | 77 +++++++++++++++++++++++++++++++++++++++
3 files changed, 93 insertions(+), 1 deletion(-)
create mode 100644 drivers/s390/cio/vfio_ccw_trace.h

@@ -135,6 +142,8 @@ static void fsm_io_request(struct vfio_ccw_private *private,
goto err_out;
io_region->ret_code = cp_prefetch(&private->cp);
+ trace_vfio_ccw_cp_prefetch(get_schid(private),
+ io_region->ret_code);
if (io_region->ret_code) {
cp_free(&private->cp);
goto err_out;
@@ -142,11 +151,13 @@ static void fsm_io_request(struct vfio_ccw_private *private,
/* Start channel program and wait for I/O interrupt. */
io_region->ret_code = fsm_io_helper(private);
+ trace_vfio_ccw_fsm_io_helper(get_schid(private),
+ io_region->ret_code);
if (io_region->ret_code) {
cp_free(&private->cp);
goto err_out;
}
- return;
+ goto out;
} else if (scsw->cmd.fctl & SCSW_FCTL_HALT_FUNC) {
/* XXX: Handle halt. */
io_region->ret_code = -EOPNOTSUPP;
@@ -159,6 +170,9 @@ static void fsm_io_request(struct vfio_ccw_private *private,
err_out:
private->state = VFIO_CCW_STATE_IDLE;
+out:
+ trace_vfio_ccw_io_fctl(scsw->cmd.fctl, get_schid(private),
+ io_region->ret_code);
}
/*

I really don't want to bikeshed, especially as some tracepoints are
better than no tracepoints, but...

We now trace fctl/schid/ret_code unconditionally (good).

We trace the outcome of cp_prefetch() and fsm_io_helper()
unconditionally. We don't, however, trace all things that may go wrong.
We have the tracepoint at the end, but it cannot tell us where the
error came from. Should we have tracepoints in every place (in this
function) that may generate an error? Only if there is an actual error?
Are the two enough for common debug scenarios?
Trace actual error sounds like a better idea than trace unconditionally
of these two functions.
These two are not enough for common debug scenarios. For example, we
cann't tell if a -EOPNOTSUPP is a orb->tm.b problem, or error code
returned by cp_init().

Idea to improve:
1. Trace actual error.
2. Define a trace event and add error trace for cp_init().

Hm. Going from what I have done in the past when doing printk debugging:

- stick in a message that is always hit, with some information about
parameters, if it makes sense
- stick in a message "foo happened!" in the error branches
- or, alternatively, trace the called functions

So tracing on failure only might be more useful? Have all failure paths
under a common knob to turn on/off?

Opinions? We can just go ahead with this and improve things later
on, I guess.
I think it's also fine to do this - better something than nothing. We
could at least have a code base to be improved to make everybody
happier in future.

Maybe keep the patch as it is now, except trace the errors only
(keeping the fctl trace point)?

What do you mean by this sentence. Get rid of vfio_ccw_io_fctl or get
rid of vfio_ccw_cp_prefetch and vfio_ccw_fsm_io_helper, or get don't
get rid of any, but make some conditional (!errno)?


Halil, as you wrote the patch (and I presume you found it helpful):
What is your opinion?


I'm in favor of this patch (as previously stated here
https://patchwork.kernel.org/patch/10298305/). And regarding the
questions under discussion I'm mostly fine either way.

I think the naming of this fctl thing is a bit cryptic,
but if we don't see this as ABI I'm fine with it -- can be improved.
What would be a better name? I was thinking along the lines accept_request.
(Bad error code would mean that the request did not get accepted. Good
code does not mean the requested function was performed successfully.)

Also I think vfio_ccw_io_fctl with no zero error code would make sense
as dev_warn. If I were an admin looking into a problem I would very much
appreciate seeing something in the messages log (and not having to enable
tracing first). This point seems to be a good one for high level 'request gone
bad' kind of report. Opinions?

Regards,
Halil