RE: [GIT PULL 18/20] lightnvm: pblk: handle case when mw_cunits equals to 0

From: Dziegielewski, Marcin
Date: Tue Jun 05 2018 - 05:18:53 EST


> From: Javier Gonzalez [mailto:javier@xxxxxxxxxxxx]
> Sent: Tuesday, June 5, 2018 9:12 AM
> To: Dziegielewski, Marcin <marcin.dziegielewski@xxxxxxxxx>
> Cc: Matias BjÃrling <mb@xxxxxxxxxxx>; Jens Axboe <axboe@xxxxxx>; linux-
> block@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Konopko, Igor J
> <igor.j.konopko@xxxxxxxxx>
> Subject: Re: [GIT PULL 18/20] lightnvm: pblk: handle case when mw_cunits
> equals to 0
>
> > On 4 Jun 2018, at 19.17, Dziegielewski, Marcin
> <marcin.dziegielewski@xxxxxxxxx> wrote:
> >
> >> From: Javier Gonzalez [mailto:javier@xxxxxxxxxxxx]
> >> Sent: Monday, June 4, 2018 1:16 PM
> >> To: Dziegielewski, Marcin <marcin.dziegielewski@xxxxxxxxx>
> >> Cc: Matias BjÃrling <mb@xxxxxxxxxxx>; Jens Axboe <axboe@xxxxxx>;
> >> linux- block@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Konopko,
> >> Igor J <igor.j.konopko@xxxxxxxxx>
> >> Subject: Re: [GIT PULL 18/20] lightnvm: pblk: handle case when
> >> mw_cunits equals to 0
> >>
> >>
> >>> On 4 Jun 2018, at 13.11, Dziegielewski, Marcin
> >> <marcin.dziegielewski@xxxxxxxxx> wrote:
> >>>> -----Original Message-----
> >>>> From: Javier Gonzalez [mailto:javier@xxxxxxxxxxxx]
> >>>> Sent: Monday, June 4, 2018 12:22 PM
> >>>> To: Dziegielewski, Marcin <marcin.dziegielewski@xxxxxxxxx>
> >>>> Cc: Matias BjÃrling <mb@xxxxxxxxxxx>; Jens Axboe <axboe@xxxxxx>;
> >>>> linux- block@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> >>>> Konopko, Igor J <igor.j.konopko@xxxxxxxxx>
> >>>> Subject: Re: [GIT PULL 18/20] lightnvm: pblk: handle case when
> >>>> mw_cunits equals to 0
> >>>>
> >>>>> On 4 Jun 2018, at 12.09, Dziegielewski, Marcin
> >>>> <marcin.dziegielewski@xxxxxxxxx> wrote:
> >>>>> Frist of all I want to say sorry for late response - I was on holiday.
> >>>>>
> >>>>>> From: Javier Gonzalez [mailto:javier@xxxxxxxxxxxx]
> >>>>>> Sent: Monday, May 28, 2018 1:03 PM
> >>>>>> To: Matias BjÃrling <mb@xxxxxxxxxxx>
> >>>>>> Cc: Jens Axboe <axboe@xxxxxx>; linux-block@xxxxxxxxxxxxxxx;
> >>>>>> linux- kernel@xxxxxxxxxxxxxxx; Dziegielewski, Marcin
> >>>>>> <marcin.dziegielewski@xxxxxxxxx>; Konopko, Igor J
> >>>>>> <igor.j.konopko@xxxxxxxxx>
> >>>>>> Subject: Re: [GIT PULL 18/20] lightnvm: pblk: handle case when
> >>>>>> mw_cunits equals to 0
> >>>>>>
> >>>>>>> On 28 May 2018, at 10.58, Matias BjÃrling <mb@xxxxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> From: Marcin Dziegielewski <marcin.dziegielewski@xxxxxxxxx>
> >>>>>>>
> >>>>>>> Some devices can expose mw_cunits equal to 0, it can cause
> >>>>>>> creation of too small write buffer and cause performance to drop
> >>>>>>> on write workloads.
> >>>>>>>
> >>>>>>> To handle that, we use the default value for MLC and beacause it
> >>>>>>> covers both 1.2 and 2.0 OC specification, setting up mw_cunits
> >>>>>>> in
> >>>>>>> nvme_nvm_setup_12 function isn't longer necessary.
> >>>>>>>
> >>>>>>> Signed-off-by: Marcin Dziegielewski
> >>>>>>> <marcin.dziegielewski@xxxxxxxxx>
> >>>>>>> Signed-off-by: Igor Konopko <igor.j.konopko@xxxxxxxxx>
> >>>>>>> Signed-off-by: Matias BjÃrling <mb@xxxxxxxxxxx>
> >>>>>>> ---
> >>>>>>> drivers/lightnvm/pblk-init.c | 10 +++++++++-
> >>>>>>> drivers/nvme/host/lightnvm.c | 1 -
> >>>>>>> 2 files changed, 9 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/lightnvm/pblk-init.c
> >>>>>>> b/drivers/lightnvm/pblk-init.c index d65d2f972ccf..0f277744266b
> >>>>>>> 100644
> >>>>>>> --- a/drivers/lightnvm/pblk-init.c
> >>>>>>> +++ b/drivers/lightnvm/pblk-init.c
> >>>>>>> @@ -356,7 +356,15 @@ static int pblk_core_init(struct pblk *pblk)
> >>>>>>> atomic64_set(&pblk->nr_flush, 0);
> >>>>>>> pblk->nr_flush_rst = 0;
> >>>>>>>
> >>>>>>> - pblk->pgs_in_buffer = geo->mw_cunits * geo->all_luns;
> >>>>>>> + if (geo->mw_cunits) {
> >>>>>>> + pblk->pgs_in_buffer = geo->mw_cunits * geo-
> >>> all_luns;
> >>>>>>> + } else {
> >>>>>>> + pblk->pgs_in_buffer = (geo->ws_opt << 3) * geo-
> >>> all_luns;
> >>>>>>> + /*
> >>>>>>> + * Some devices can expose mw_cunits equal to 0, so
> >> let's
> >>>>>> use
> >>>>>>> + * here default safe value for MLC.
> >>>>>>> + */
> >>>>>>> + }
> >>>>>>>
> >>>>>>> pblk->min_write_pgs = geo->ws_opt * (geo->csecs /
> PAGE_SIZE);
> >>>>>>> max_write_ppas = pblk->min_write_pgs * geo->all_luns; diff
> >>>>>>> --git a/drivers/nvme/host/lightnvm.c
> >>>>>>> b/drivers/nvme/host/lightnvm.c index
> >>>>>>> 41279da799ed..c747792da915 100644
> >>>>>>> --- a/drivers/nvme/host/lightnvm.c
> >>>>>>> +++ b/drivers/nvme/host/lightnvm.c
> >>>>>>> @@ -338,7 +338,6 @@ static int nvme_nvm_setup_12(struct
> >>>>>> nvme_nvm_id12
> >>>>>>> *id,
> >>>>>>>
> >>>>>>> geo->ws_min = sec_per_pg;
> >>>>>>> geo->ws_opt = sec_per_pg;
> >>>>>>> - geo->mw_cunits = geo->ws_opt << 3; /* default to MLC
> >> safe values
> >>>>>> */
> >>>>>>> /* Do not impose values for maximum number of open blocks as it
> is
> >>>>>>> * unspecified in 1.2. Users of 1.2 must be aware of this and
> >>>>>>> eventually
> >>>>>>> --
> >>>>>>> 2.11.0
> >>>>>>
> >>>>>> By doing this, 1.2 future users (beyond pblk), will fail to have
> >>>>>> a valid mw_cunits value. It's ok to deal with the 0 case in pblk,
> >>>>>> but I believe that we should have the default value for 1.2 either
> way.
> >>>>>
> >>>>> I'm not sure. From my understanding, setting of default value was
> >>>>> workaround for pblk case, am I right ?.
> >>>>
> >>>> The default value covers the MLC case directly at the lightnvm
> >>>> layer, as opposed to doing it directly in pblk. Since pblk is the
> >>>> only user now, you can argue that all changes in the lightnvm layer
> >>>> are to solve pblk issues, but the idea is that the geometry should be
> generic.
> >>>>
> >>>>> In my opinion any user of 1.2
> >>>>> spec should be aware that there is not mw_cunit value. From my
> >>>>> point of view, leaving here 0 (and decision what do with it to
> >>>>> lightnvm
> >>>>> user) is more safer way, but maybe I'm wrong. I believe that it is
> >>>>> topic to wider discussion with maintainers.
> >>>>
> >>>> 1.2 and 2.0 have different geometries, but when we designed the
> >>>> common nvm_geo structure, the idea was to abstract both specs and
> >>>> allow the upper layers to use the geometry transparently.
> >>>>
> >>>> Specifically in pblk, I would prefer to keep it in such a way that
> >>>> we don't need to media specific policies (e.g., set default values
> >>>> for MLC memories), as a general design principle. We already do
> >>>> some geometry version checks to avoid dereferencing unnecessary
> >>>> pointers on the fast path, which I would eventually like to remove.
> >>>
> >>> Ok, now I understand your point of view and agree with that, I will
> >>> prepare second version of this patch without this change.
> >>
> >> Sounds good.
> >>
> >>> Thanks for
> >>> the clarification.
> >>
> >> Sure :)
> >>
> >>>>>> A more generic way of doing this would be to have a default value
> >>>>>> for
> >>>>>> 2.0 too, in case mw_cunits is reported as 0.
> >>>>>
> >>>>> Since 0 is correct value and users can make different decisions
> >>>>> based on it, I think we shouldn't overwrite it by default value.
> >>>>> Is it make sense?
> >>>>
> >>>> Here I meant at a pblk level - I should have specified it. At the
> >>>> geometry level, we should not change it.
> >>>>
> >>>> The case I am thinking is if mw_cuints repoints 0, but ws_min > 0.
> >>>> In this case, we still need a host side buffer to serve < ws_min
> >>>> I/Os, even though the device does not require the buffer to guarantee
> reads.
> >>>
> >>> Oh, ok now we are on the same page. In this patch I was trying to
> >>> address such case. Do you have other idea how to do it or here are
> >>> you thinking only on value of default variable?
> >>
> >> If doing this, I guess that something in the line of what you did
> >> with increasing the size of the write buffer via a module parameter.
> >> For example, checking if the size of the write buffer based on
> >> mw_cuints is enough to cover ws_min, which normally would only be an
> >> issue when mw_cuints == 0 or when the number of PUs used for the pblk
> >> instance is very small and mw_cuints < nr_luns * ws_min.
> >
> >
> > I see here two cases:
> > - when mw_cunits > 0 buffer size should have number of entries at
> > least max(mw_cunits, ws_min) * nr_luns and here we are taking care of
> > both cases mw_cunits > ws_min and mw_cunits < ws_min.
> > - when mw_cunit == 0 buffer size should have number of entries at
> > least ws_min * nr _luns and we can use the same puseudocode as above.
> >
>
> Agree.
>
> > Do you see any other case? Could you clarify second case mentioned by
> > you or maybe did you mean opposite case? If yes, I believe that above
> > pseudo code will handle such case too.
> >
>
> Yes, it is the same case.
>
> One thing to consider is whether the buffer should at least be ws_opt *
> nr_luns for performance reasons. Since the write thread will always try to
> send ws_opt, in the case that ws_opt > ws_min, then a buffer size of
> ws_min * nr_luns will not make use of the whole parallelism exposed by the
> device.
>

Agree. After I had sent last email I also thought that should be ws_opt instead of ws_min.

> Therefore, I would probably go for ws_opt * nr_luns as the default value
> when mw_cuints * nr_luns < ws_opt * nr_luns (which covers mw_cuints ==
> 0), and then keep ws_min * nr_luns as the minimum requirement when
> setting the buffer size manually.

Sounds good.
>
> Does this cover your use case?
>

Yes, I think that cover all cases related to this topic. I will prepare patch in the afternoon.

Many thanks for perfect cooperation!
Marcin
> >>>>>> Javier
> >>>>>
> >>>>> Thanks,
> >>>>> Marcin
> >>>>
> >>>> Javier
> >>> Thanks,
> >>> Marcin
> > Thanks!,
> > Marcin