Re: [PATCH v9 3/3] soc: qcom: rpmh: Invoke rpmh_flush() for dirty caches

From: Maulik Shah
Date: Thu Mar 05 2020 - 04:41:48 EST


Hi,

On 3/4/2020 6:10 AM, Doug Anderson wrote:
> Hi,
>
> On Mon, Mar 2, 2020 at 9:47 PM Maulik Shah <mkshah@xxxxxxxxxxxxxx> wrote:
>>
>> On 2/29/2020 5:15 AM, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Fri, Feb 28, 2020 at 3:38 AM Maulik Shah <mkshah@xxxxxxxxxxxxxx> wrote:
>>>> Add changes to invoke rpmh flush() from within cache_lock when the data
>>>> in cache is dirty.
>>>>
>>>> This is done only if OSI is not supported in PSCI. If OSI is supported
>>>> rpmh_flush can get invoked when the last cpu going to power collapse
>>>> deepest low power mode.
>>>>
>>>> Also remove "depends on COMPILE_TEST" for Kconfig option QCOM_RPMH so the
>>>> driver is only compiled for arm64 which supports psci_has_osi_support()
>>>> API.
>>>>
>>>> Signed-off-by: Maulik Shah <mkshah@xxxxxxxxxxxxxx>
>>>> Reviewed-by: Srinivas Rao L <lsrao@xxxxxxxxxxxxxx>
>>>> ---
>>>> drivers/soc/qcom/Kconfig | 2 +-
>>>> drivers/soc/qcom/rpmh.c | 33 ++++++++++++++++++++++-----------
>>>> 2 files changed, 23 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
>>>> index d0a73e7..2e581bc 100644
>>>> --- a/drivers/soc/qcom/Kconfig
>>>> +++ b/drivers/soc/qcom/Kconfig
>>>> @@ -105,7 +105,7 @@ config QCOM_RMTFS_MEM
>>>>
>>>> config QCOM_RPMH
>>>> bool "Qualcomm RPM-Hardened (RPMH) Communication"
>>>> - depends on ARCH_QCOM && ARM64 || COMPILE_TEST
>>>> + depends on ARCH_QCOM && ARM64
>>>> help
>>>> Support for communication with the hardened-RPM blocks in
>>>> Qualcomm Technologies Inc (QTI) SoCs. RPMH communication uses an
>>>> diff --git a/drivers/soc/qcom/rpmh.c b/drivers/soc/qcom/rpmh.c
>>>> index f28afe4..6a5a60c 100644
>>>> --- a/drivers/soc/qcom/rpmh.c
>>>> +++ b/drivers/soc/qcom/rpmh.c
>>>> @@ -12,6 +12,7 @@
>>>> #include <linux/module.h>
>>>> #include <linux/of.h>
>>>> #include <linux/platform_device.h>
>>>> +#include <linux/psci.h>
>>>> #include <linux/slab.h>
>>>> #include <linux/spinlock.h>
>>>> #include <linux/types.h>
>>>> @@ -158,6 +159,13 @@ static struct cache_req *cache_rpm_request(struct rpmh_ctrlr *ctrlr,
>>>> }
>>>>
>>>> unlock:
>>>> + if (ctrlr->dirty && !psci_has_osi_support()) {
>>>> + if (rpmh_flush(ctrlr)) {
>>>> + spin_unlock_irqrestore(&ctrlr->cache_lock, flags);
>>>> + return ERR_PTR(-EINVAL);
>>>> + }
>>>> + }
>>> It's been a long time since I looked in depth at RPMH, but upon a
>>> first glance this seems like it's gonna be terrible for performance.
>>> You're going to send every entry again and again, aren't you? In
>>> other words in pseudo-code:
>>>
>>> 1. rpmh_write(addr=0x10, data=0x99);
>>> ==> writes on the bus (0x10, 0x99)
>>>
>>> 2. rpmh_write(addr=0x11, data=0xaa);
>>> ==> writes on the bus (0x10, 0x99)
>>> ==> writes on the bus (0x11, 0xaa)
>>>
>>> 3. rpmh_write(addr=0x10, data=0xbb);
>>> ==> writes on the bus (0x10, 0xbb)
>>> ==> writes on the bus (0x11, 0xaa)
>>>
>>> 4. rpmh_write(addr=0x12, data=0xcc);
>>> ==> writes on the bus (0x10, 0xbb)
>>> ==> writes on the bus (0x11, 0xaa)
>>> ==> writes on the bus (0x12, 0xcc)
>>>
>>> That seems bad.
>> Hi Doug,
>>
>> No this is NOT how data is sent to RPMh/AOSS.
>> The rpmh_flush() fills up DRV-2 (HLOS) TCSes, makes it ready and The HW
>> takes care of
>> sending data of Sleep TCSes for each of the EL/ DRV(s) when Last cpu is
>> going to deepest
>> low power mode and of WAKE TCSes while first cpu is waking up.
> Ah, I see. So for sleep / wake commands we never directly wait for
> them to go out on the bus while the system is awake. We just program
> them all to the RPMH hardware and they'll set there and all get sent
> automatically when the last CPU goes into deepest low power mode.
>
> ...so actually the whole point of OSI mode (from an RPMH perspective)
> is not to avoid transactions on the bus. It's just avoiding
> programming RPMH over and over again. Is that correct?
>
> ...and the reason we have all these data structures in the kernel is
> to keep track of auxiliary information about the things in the
> sleep/wake TCSs and make it easier to update bits of them?
correct.
>
>
>>> Why can't you just send the new request itself and
>>> forget adding it to the cache? In other words don't even call
>>> cache_rpm_request() in the non-OSI case and then in __rpmh_write()
>>> just send right away...
>> This wonât work out. Let me explain whyâ
>>
>> We have 3 SLEEP and 3 WAKE TCSes from below config..
>> qcom,tcs-config = <ACTIVE_TCS 2>,
>> <SLEEP_TCS 3>,
>> <WAKE_TCS 3>,
>> Each TCS has total 16 commands so total 48 commands(16*3) for each SLEEP
>> and WAKE TCSes,
>> that can be filled up.
>>
>> Now Lets take a example in pseudo-code on what could happen if we donât
>> cache and
>> immediately fill up TCSes commands. The triggering part doesnât happen
>> as explained above
>> it fills up TCSes and makes them ready..
>>
>> Time-t0 (from client_x invoking rpmh_write_batch() for SLEEP SET, a
>> batch of 3 commands)
>>
>> rpmh_write_batch(
>> addr=0x10, data=0x99, -> fills up CMD0 in SLEEP TCS_0
>> addr=0x11, data=0xaa, -> fills up CMD1 in SLEEP TCS_0
>> addr=0x10, data=0xbb); -> fills up CMD2 in SLEEP TCS_0
>>
>> Time-t1 (from client_y invoking rpmh_write(), a single command)
>>
>> rpmh_write(
>> addr=0x12, data=0xcc, -> fills up CMD3 in SLEEP TCS_0
>> );
>>
>> Time-t2 (from client_x invokes rpmh_invalidate() which invalidates all
>> previous *batch requests* only)
>>
>> At this point, it should have CMD3 only in TCS while CMD 0,1,2 needs to
>> be freed up, since we expect
>> a new batch request now.
>>
>> Since driver didnât cache anything in the first place, it doesnât know
>> details about previous batch request
>> like how many commands it had, what were the commands of those batches
>> when filling up in TCSes, and so onâ
>> (basically all the data required to free up only CMD 0,1,2, and donât
>> disturb CMD3)
>>
>> Whats more?
>>
>> The new batch request could be of let say 5 commands after invalidation,
>> instead of 3 commands in previous batch.
>> So it will not fit in CMD-0,1,2 and we might want to allocate from
>> CMD-4,5,6,7,8 now.
>>
>> This will leave a hole in TCS CMDs (each TCS has 16 total commands)
>> unless we re-arrange everything.
>> Also we may want to fill up batch request first and then single
>> requests, by not caching anything, driver donât
>> know which one is batch and which one is single request.
> OK, I got it now. I'll try to spend some time tomorrow looking over
> everything / testing with my new understanding.
>
>
>> There are other cases like below which also gets impacted if driver
>> don't cache anything...
>>
>> for example, when we donât have dedicated ACTIVE TCS ( if we have below
>> config with ACTIVE TCS count 0)
>> qcom,tcs-config = <ACTIVE_TCS 0>,
>> <SLEEP_TCS 3>,
>> <WAKE_TCS 3>,
>>
>> Now to send active data, driver may re-use/ re-purpose few of the sleep
>> or wake TCS, to be used as ACTIVE TCS and once work is done,
>> it will be re-allocated in SLEEP/ WAKE TCS pool accordingly. If driver
>> donât cache, all the SLEEP and WAKE data is lost when one
>> of TCS is repurposed to use as ACTIVE TCS.
> Ah, interesting. I'll read the code more, but are you expecting this
> type of situation to work today, or is it theoretical for the future?
yes, we have targets which needs to work with this type of situation.
>
>
>> Hope above explanation clears why caching is important and gives clear
>> view of caching v/s not caching.
>>
>> Thanks,
>> Maulik
>>
>>> I tried to test this and my printouts didn't show anything actually
>>> happening in rpmh_flush(). Maybe I just don't have the write patches
>>> to exercise this properly...
>> it may be due to missing interconnect patches series
>> https://patchwork.kernel.org/project/linux-arm-msm/list/?series=247175
> I ended up pulling those in but I was still not seeing things work as
> I expected. I'll debug more tomorrow to see if it was my expectations
> that were wrong or if there was a real issue.
>
>
>>>> spin_unlock_irqrestore(&ctrlr->cache_lock, flags);
>>>>
>>>> return req;
>>>> @@ -285,26 +293,35 @@ int rpmh_write(const struct device *dev, enum rpmh_state state,
>>>> }
>>>> EXPORT_SYMBOL(rpmh_write);
>>>>
>>>> -static void cache_batch(struct rpmh_ctrlr *ctrlr, struct batch_cache_req *req)
>>>> +static int cache_batch(struct rpmh_ctrlr *ctrlr, struct batch_cache_req *req)
>>>> {
>>>> unsigned long flags;
>>>>
>>>> spin_lock_irqsave(&ctrlr->cache_lock, flags);
>>>> +
>>>> list_add_tail(&req->list, &ctrlr->batch_cache);
>>>> ctrlr->dirty = true;
>>>> +
>>>> + if (!psci_has_osi_support()) {
>>>> + if (rpmh_flush(ctrlr)) {
>>>> + spin_unlock_irqrestore(&ctrlr->cache_lock, flags);
>>>> + return -EINVAL;
>>>> + }
>>>> + }
>>>> +
>>>> spin_unlock_irqrestore(&ctrlr->cache_lock, flags);
>>>> +
>>>> + return 0;
>>>> }
>>>>
>>>> static int flush_batch(struct rpmh_ctrlr *ctrlr)
>>>> {
>>>> struct batch_cache_req *req;
>>>> const struct rpmh_request *rpm_msg;
>>>> - unsigned long flags;
>>>> int ret = 0;
>>>> int i;
>>>>
>>>> /* Send Sleep/Wake requests to the controller, expect no response */
>>>> - spin_lock_irqsave(&ctrlr->cache_lock, flags);
>>>> list_for_each_entry(req, &ctrlr->batch_cache, list) {
>>>> for (i = 0; i < req->count; i++) {
>>>> rpm_msg = req->rpm_msgs + i;
>>>> @@ -314,7 +331,6 @@ static int flush_batch(struct rpmh_ctrlr *ctrlr)
>>>> break;
>>>> }
>>>> }
>>>> - spin_unlock_irqrestore(&ctrlr->cache_lock, flags);
>>>>
>>>> return ret;
>>>> }
>>>> @@ -386,10 +402,8 @@ int rpmh_write_batch(const struct device *dev, enum rpmh_state state,
>>>> cmd += n[i];
>>>> }
>>>>
>>>> - if (state != RPMH_ACTIVE_ONLY_STATE) {
>>>> - cache_batch(ctrlr, req);
>>>> - return 0;
>>>> - }
>>>> + if (state != RPMH_ACTIVE_ONLY_STATE)
>>>> + return cache_batch(ctrlr, req);
>>> I'm curious: why not just do:
>>>
>>> if (state != RPMH_ACTIVE_ONLY_STATE && psci_has_osi_support()) {
>>> cache_batch(ctrlr, req);
>>> return 0;
>>> }
>>>
>>> ...AKA don't even cache it up if we're not in OSI mode. IIUC this
>>> would be a huge deal because with your code you're doing the whole
>>> RPMH transfer under "spin_lock_irqsave", right? And presumably RPMH
>>> transfers are somewhat slow, otherwise why did anyone come up with
>>> this whole caching / last-man-down scheme to start with?
>>>
>>> OK, it turned out to be at least slightly more complex because it
>>> appears that we're supposed to use rpmh_rsc_write_ctrl_data() for
>>> sleep/wake stuff and that they never do completions, but it really
>>> wasn't too hard. I prototyped it at <http://crrev.com/c/2080916>.
>>> Feel free to hijack that change if it looks like a starting point and
>>> if it looks like I'm not too confused.
>> I looked at this change and thought of it earlier but it wonât work out
>> for the reasons in above example.
>> I have thought of few optimizations in rpmh_flush() to reduce its time,
>> if we *really* see any performance impactâ
>>
>> below is high level ideaâ
>> When rpmh_write_batch() is invoked for SLEEP_SETs, currently
>> rpmh_flush() will update both SLEEP and WAKE TCS contents,
>> However we may change it to update only SLEEP TCS, and when
>> rpmh_write_batch() is invoked for WAKE SETs, update only WAKE TCS contents.
>> This way it may reduce time by roughly ~50%.
> OK, that's something to keep in mind. Agree that it doesn't have the
> be part of the initial change.
>
> -Doug
Thanks,
Maulik

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation