Re: [PATCH 1/2] mailbox: switch to hrtimer for tx_complete polling

From: Sudeep Holla
Date: Mon Jul 27 2015 - 05:48:14 EST

Next message: Michal Simek: "[PATCH 1/7] ARM64: zynqmp: Use zynqmp specific compatible string for gpio"
Previous message: Michal Suchanek: "Re: [PATCH 08/11] MTD: m25p80: Add option to limit SPI transfer size."
In reply to: Jassi Brar: "Re: [PATCH 1/2] mailbox: switch to hrtimer for tx_complete polling"
Next in thread: Jassi Brar: "Re: [PATCH 1/2] mailbox: switch to hrtimer for tx_complete polling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 27/07/15 04:26, Jassi Brar wrote:

On Fri, Jul 24, 2015 at 2:17 PM, Sudeep Holla <sudeep.holla@xxxxxxx> wrote:

On 24/07/15 06:02, Jassi Brar wrote:

On Wed, Jul 22, 2015 at 5:58 PM, Sudeep Holla <sudeep.holla@xxxxxxx>
wrote:

we might end-up waiting
for atleast a jiffy even though the response for that message from the
remote is received via interrupt and processed in relatively smaller
time granularity.

That is wrong.

No see below.

If the controller supports TX interrupt it should set txdone_irq,
which prevents polling i.e, controller driver calls mbox_chan_txdone.

If the controller doesn't support TX interrupt but the client
receives some ack packet, then the client should set knows_txdone and
call mbox_client_txdone. Again you don't have to wait on polling.

Sorry if I was not clear in the commit message, but I thought I did
mention TXDONE_BY_POLL. The case I am referring is definitely not
TXDONE_BY_IRQ or TXDONE_BY_ACK.

That statement is still wrong. The TXDONE_BY_POLL modifier does't make it right.

I am fine to modify/clarify that statement.

Anyways, I see you meant the 3rd case of neither IRQ nor ACK.

Yes the remote indicates by setting a flag in status register.

Few reasons I have in mind as why we need that:

1. The remote processor is capable of dealing with requests in orders of
milliseconds. E.g. on Juno it reports the DVFS transition latency is
1.2ms. Now if we report that to CPUFreq governor, we need to ensure
that. With current jiffy based timer we see latencies > 10ms.

Can I see the code somewhere?

Part of Juno SCPI support series[1].

It seems your remote doesn't send some protocol level 'ack' packet
replying if the command was successfully executed or not. That means
Linux can't differentiate successful execution of the command from a
silent failure (remote still has to set the TX_done flag to make way
for next messages).

Agreed and again I confirm the remote processor in question just sets
the flag always and correctly and doesn't use a protocol ACK. So I am
not sure the context of the above statement for this particular issue
under discussion.

From Linux' POV our best bet seems to be to submit
the request and simply assume it done. So probably do async rather
than blocking.

How if it's not Tx is not based on protocol ACK packet. The timer is
only solution IMO for the case where Tx done is indicated by a status flag.

2. Because of that, under stress testing with multiple clients active at
a time, I am seeing the mailbox buffer overflows quite easily just
because it's blocked on Tx polling(almost 10x slower) and doesn't
process new requests though the remote can handle.

Yeah this situation may arise. The following fix is needed regardless, I think.

IIUC it just triggers poll_txdone when chan->active_req is set, but that
will anyway happen through timer, no ?

Anyway it's not the main topic of discussion here and can be taken up
separately.

============================
diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 07fd507..a1dded9 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -64,7 +64,12 @@ static void msg_submit(struct mbox_chan *chan)

spin_lock_irqsave(&chan->lock, flags);

- if (!chan->msg_count || chan->active_req)
+ if (chan->active_req) {
+ err = 0;
+ goto exit;
+ }
+
+ if (!chan->msg_count)
goto exit;

count = chan->msg_count;
============================

3. Also there are efforts for scheduler-driven cpu frequency selection
where the desired latency should be as low as possible. Juri(cc-ed)
is working on that and reported this issue with mailbox core.

Sure, latency should be as low as possible. Sounds similar to (1). Was
it reported on some mailing list?

No, I am reporting it now in the form of this patch (with the solution).

Hope this clarifies the reasons for switching to hrtimer.

I am not against using hrtimer, just need to make sure we don't simply
suppress the symptoms of wrong implementation.

Agreed, and that's a valid concern. So far based on the testing and
benchmarking done so far, we don't think this patch is suppressing
anything incorrectly.

If you still have concerns with this solution, please explain them here
so that we can discuss and come to conclusion and the issue is fixed.

Regards,
Sudeep

[1] https://lkml.org/lkml/2015/7/23/270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Michal Simek: "[PATCH 1/7] ARM64: zynqmp: Use zynqmp specific compatible string for gpio"
Previous message: Michal Suchanek: "Re: [PATCH 08/11] MTD: m25p80: Add option to limit SPI transfer size."
In reply to: Jassi Brar: "Re: [PATCH 1/2] mailbox: switch to hrtimer for tx_complete polling"
Next in thread: Jassi Brar: "Re: [PATCH 1/2] mailbox: switch to hrtimer for tx_complete polling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]