Re: [PATCH v2] ASoC: cros_ec_codec: Log results when EC commands fail

From: Yu-Hsuan Hsu
Date: Mon Jul 06 2020 - 05:46:42 EST


Yu-Hsuan Hsu <yuhsuan@xxxxxxxxxxxx> æ 2020å7æ4æ éå äå7:34åéï
>
> Guenter Roeck <groeck@xxxxxxxxxx> æ 2020å7æ4æ éå äå3:28åéï
> >
> > On Fri, Jul 3, 2020 at 12:11 PM Yu-Hsuan Hsu <yuhsuan@xxxxxxxxxxxx> wrote:
> > >
> > > Guenter Roeck <groeck@xxxxxxxxxx> æ 2020å7æ3æ éä äå11:58åéï
> > > >
> > > > On Fri, Jul 3, 2020 at 3:56 AM Enric Balletbo i Serra
> > > > <enric.balletbo@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > Hi Yu-Hsuan,
> > > > >
> > > > > On 3/7/20 11:40, Yu-Hsuan Hsu wrote:
> > > > > > Enric Balletbo i Serra <enric.balletbo@xxxxxxxxxxxxx> æ 2020å7æ3æ éä äå5:19åéï
> > > > > >>
> > > > > >> Hi Yu-Hsuan,
> > > > > >>
> > > > > >> On 3/7/20 10:48, Yu-Hsuan Hsu wrote:
> > > > > >>> Enric Balletbo i Serra <enric.balletbo@xxxxxxxxxxxxx> æ 2020å7æ3æ éä äå4:38åéï
> > > > > >>>>
> > > > > >>>> Hi Yu-Hsuan,
> > > > > >>>>
> > > > > >>>> Thank you for your patch
> > > > > >>>>
> > > > > >>>> On 3/7/20 9:19, Yu-Hsuan Hsu wrote:
> > > > > >>>>> Log results of failed EC commands to identify a problem more easily.
> > > > > >>>>>
> > > > > >>>>> Replace cros_ec_cmd_xfer_status with cros_ec_cmd_xfer because the result
> > > > > >>>>> has already been checked in this function. The wrapper is not needed.
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>> Nack, we did an effort to remove all public users of cros_ec_cmd_xfer() in
> > > > > >>>> favour of cros_ec_cmd_xfer_status() and you are reintroducing again. You can do
> > > > > >>>> the same but using cros_ec_cmd_xfer_status(). In fact, your patch will not build
> > > > > >>>> on top of the upcoming changes.
> > > > > >>> Thanks! But I have a question about implementing it. Does it look like
> > > > > >>> the one below?
> > > > > >>> ret = cros_ec_cmd_xfer_status(ec_dev, msg);
> > > > > >>> if (ret < 0) {
> > > > > >>
> > > > > >> In this case will already print an error.
> > > > > >>
> > > > > >> What are you trying to achieve?
> > > > > >>
> > > > > >> If the only reason is of this patch is print a message you should either, or
> > > > > >> enable dynamic printk and enable dev_dbg or event better use the kernel trace
> > > > > >> functionality. There is no need to be more verbose.
> > > > > >>
> > > > > >> Example:
> > > > > >> $ echo 1 > /sys/kernel/debug/tracing/events/cros_ec/enable
> > > > > >> $ cat /sys/kernel/debug/tracing/trace
> > > > > >>
> > > > > >> 369.416372: cros_ec_request_start: version: 0, command: EC_CMD_USB_PD_POWER_INFO
> > > > > >> 369.420528: cros_ec_request_done: version: 0, command:
> > > > > >> EC_CMD_USB_PD_POWER_INFO, ec result: EC_RES_SUCCESS, retval: 16
> > > > > >>
> > > > > >> Cheers,
> > > > > >> Enric
> > > > > >>
> > > > > > Thank Enric,
> > > > > >
> > > > > > The situation is that some users encountered errors on ChromeBook.
> > > > >
> > > > > And, aren't you able to reproduce the issue?
> > > > >
> > > > >
> > > > > > From their feedback reports, we only get the message like
> > > > > > 'cros-ec-codec GOOG0013:00: ASoC: Failed to set DAI format: -71'.
> > > > > > We know that -71 is -EPROTO but it is not clear enough for us to find
> > > > > > out the root cause. That's why we want the detail of the result.
> > > > >
> > > > >
> > > > > If I am not mistaken this ends calling i2s_rx_set_daifmt() into the EC firmware,
> > > > > if the result is -EPROTO that means is not returning EC_RES_SUCCESS, so there
> > > > > are few options:
> > > > >
> > > > > if (i2s_rx_enabled)
> > > > > return EC_RES_BUSY;
> > > > >
> > > > > if (daifmt >= EC_CODEC_I2S_RX_DAIFMT_COUNT)
> > > > > return EC_RES_INVALID_PARAM;
> > > > >
> > > > > if (audio_codec_i2s_rx_set_daifmt(daifmt) != EC_SUCCESS)
> > > > > return EC_RES_ERROR;
> > > > >
> > > > > > Because the situation happens on users' side, it is not possible for
> > > > > > them to enable kernel trace (ChromeOS does not allow users to touch
> > > > > > kernel).
> > > > > >
> > > > >
> > > > > Are you sure that when you know the error code you'll find the root cause
> > > > > (without adding more prints)? There is only three possibilities? You can't start
> > > > > adding prints just to debug a user issue because you don't allow to be more
> > > > > verbose. I understand that might help you but is not the way to go.
> > >
> > > Hi Enric and Guenter,
> > >
> > > Thanks for your inspiring comments.
> > > I'm not sure whether we will find the root cause if I know the error
> > > code. But I think it's not a point.
> > > We wanted to add this error log because we found that the current one
> > > is not enough. Since it is a real error, it would be better if we can
> > > make it more detailed, right?
> > > In addition, we thought it would be helpful in the future as well.
> > > That's why we chose to upstream instead of merging into our source
> > > tree only.
> > >
> >
> > Kernel log messages are almost never useful for actual users. In most
> > cases, they just clog up the kernel log, making it useless for when it
> > actually counts (such as when the system crashes). On top of that,
> > there already is an error message, only the error code doesn't mean
> > much because it is "universalized". I don't see how adding another
> > error message would improve that, even more so if that error message
> > is only added in one place. On the other side, converting EC error
> > codes to Linux kernel error codes would help every caller of
> > cros_ec_cmd_xfer_status without adding more logging noise.
>
> Thanks for the explanation. If I understand correctly, the mean point
> is this log is noisy.
> In this case, the log will only appear one time during the card
> probing(if it fails). But I do agree that it may create some potential
> noises.
> Converting EC error codes to Linux kernel error codes also work for
> me. Is it possible for you to send the corresponding patch? If not, I
> can try to work on it.
> In addition, is it possible that one error code can come from linux or from EC?
>
> >
> > > > >
> > > > > You should really reproduce the issue yourself an use actual debug
> > > > > tools/prints./traces
> > > We are trying but still unable to reproduce this issue.
> > > However, as I maintained above, it is not a main concern of this change.
> > > > >
> > > >
> > > > Another possibility would be to change cros_ec_cmd_xfer_status() to
> > > > return a more granular error code, such as -EINVAL for
> > > > EC_RES_INVALID_PARAM, -EBUSY for EC_RES_BUSY, -EINPROGRESS for
> > > > EC_RES_IN_PROGRESS, -ETIMEDOUT for EC_RES_TIMEOUT, -EOVERFLOW for
> > > > EC_RES_OVERFLOW, -ENODATA for EC_RES_UNAVAILABLE, and so on.
> > > Since there are many kinds of results from EC, why not just make users
> > > able to check on their own?
> > > For example, users can wait and try again if the result is EC_RES_BUSY.
> > >
> >
> > That is exactly what -EBUSY is for: It lets the user space application
> > decide what to do about it. A kernel log message can not and will
> > never achieve that.
> >
> > > >
> > > > However, it appears that the various low level functions already
> > > > replace various EC error codes with a blank EC_RES_ERROR. No amount of
> > > > logging will tell us what exactly went wrong in those functions. Lucky
> > > > for us, audio_codec_i2s_rx_set_daifmt() only ever returns EC_SUCCESS,
> > > > so we know that the problem is either that i2s_rx_enabled is true or
> > > > that daifmt is too large. None of those really warrants more verbose
> > > > logging.
> > > >
> > > > From the context, my personal bet is that i2s_rx_enabled is true: I
> > > > don't immediately see how disabling it is enforced before trying to
> > > > set the DAI format, and I don't see how "daifmt >=
> > > > EC_CODEC_I2S_RX_DAIFMT_COUNT" can ever be true.
> > > I totally agree. According to the source, it seems that both path are
> >
> > You are effectively saying that there is code to ensure that
> > i2s_rx_enabled is false. Granted, the code is too complex to easily
> > understand, and I may not have seen that flow. Per the same logic,
> > though, it might well be possible that _because_ the code is not easy
> > to understand there may well be a data path where i2s_rx_enabled is
> > set. One would need to ensure that the sequence of <disable rx> - <set
> > dai format> - <re-enable rx> is guaranteed, and that it is always
> > executed under a kernel lock. Maybe you can point me to that code, for
> > my education.
> >
> > On the other side, it is much easier to verify that "daifmt >=
> > EC_CODEC_I2S_RX_DAIFMT_COUNT" is never true.
> >
> > Thanks,
> > Guenter
> The complete errors are:
> ERR kernel: [ 8.568171] cros-ec-codec GOOG0013:00: ASoC: error at
> snd_soc_dai_set_fmt on GOOG0013:00: -71
> WARNING kernel: [ 8.583312] cros-ec-codec GOOG0013:00: ASoC: Failed
> to set DAI format: -71
> ERR kernel: [ 8.629388] acp3x-alc5682-max98357 AMDI5682:00:
> devm_snd_soc_register_card(acp3xalc5682m98357) failed: -71
> WARNING kernel: [ 8.645306] acp3x-alc5682-max98357: probe of
> AMDI5682:00 failed with error -71
>
> This issue happened when probing the card while the i2s_rx is only
> enabled when the device is used. Therefore, the i2s_rx should not be
> enabled before the card is probed. However, I may miss something. I
> will check the entire path later.
I finally found a possible root cause. EC does not restart when we
reboots kernel. It means if kernel crashes while i2s_rx is enabled, we
will trigger this issue. In addition, the issue will stay until the
next time rebooting EC.

I will work on the fix. Thank you for very helpful information.

> For "daifmt >= EC_CODEC_I2S_RX_DAIFMT_COUNT", it is obviously not to
> be true by looking into i2s_rx_set_fmt function.
>
> Thanks,
> Yu-Hsuan
>
> >
> > > impossible. I'm not really understand the whole path but is it
> > > possible for EC to return other results? I will do more tests and look
> > > carefully into the source. Really thanks for your suggestions.
> > >
> > > Cheers,
> > > Yu-Hsuan
> > > >
> > > > Guenter
> > > >
> > > >
> > > > Guenter
> > > >
> > > > > Cheers,
> > > > > Enric
> > > > >
> > > > > > The other way we thought is changing dev_dbg to dev_err in
> > > > > > cros_ec_cmd_xfer_status. But we are not sure whether it is also an
> > > > > > error for other usages.
> > > > > >
> > > > > >>> if (ret == -EPROTO)
> > > > > >>> dev_err(..., msg->result)
> > > > > >>> goto error;
> > > > > >>> }
> > > > > >>> I'm not sure whether it makes sense to check ret == -EPROTO here.
> > > > > >>>
> > > > > >>>>
> > > > > >>>>> Signed-off-by: Yu-Hsuan Hsu <yuhsuan@xxxxxxxxxxxx>
> > > > > >>>>> ---
> > > > > >>>>> sound/soc/codecs/cros_ec_codec.c | 9 ++++++++-
> > > > > >>>>> 1 file changed, 8 insertions(+), 1 deletion(-)
> > > > > >>>>>
> > > > > >>>>> diff --git a/sound/soc/codecs/cros_ec_codec.c b/sound/soc/codecs/cros_ec_codec.c
> > > > > >>>>> index 8d45c628e988e..a4ab62f59efa6 100644
> > > > > >>>>> --- a/sound/soc/codecs/cros_ec_codec.c
> > > > > >>>>> +++ b/sound/soc/codecs/cros_ec_codec.c
> > > > > >>>>> @@ -90,10 +90,17 @@ static int send_ec_host_command(struct cros_ec_device *ec_dev, uint32_t cmd,
> > > > > >>>>> if (outsize)
> > > > > >>>>> memcpy(msg->data, out, outsize);
> > > > > >>>>>
> > > > > >>>>> - ret = cros_ec_cmd_xfer_status(ec_dev, msg);
> > > > > >>>>> + ret = cros_ec_cmd_xfer(ec_dev, msg);
> > > > > >>>>> if (ret < 0)
> > > > > >>>>> goto error;
> > > > > >>>>>
> > > > > >>>>> + if (msg->result != EC_RES_SUCCESS) {
> > > > > >>>>> + dev_err(ec_dev->dev, "Command %d failed: %d\n", cmd,
> > > > > >>>>> + msg->result);
> > > > > >>>>> + ret = -EPROTO;
> > > > > >>>>> + goto error;
> > > > > >>>>> + }
> > > > > >>>>> +
> > > > > >>>>> if (insize)
> > > > > >>>>> memcpy(in, msg->data, insize);
> > > > > >>>>>
> > > > > >>>>>