RE: [PATCH 14/26] cxl/region: Read existing extents on region creation

From: Dan Williams
Date: Mon May 06 2024 - 14:34:52 EST


ira.weiny@ wrote:
> From: Navneet Singh <navneet.singh@xxxxxxxxx>
>
> Dynamic capacity device extents may be left in an accepted state on a
> device due to an unexpected host crash. In this case creation of a new
> region on top of the DC partition (region) is expected to expose those
> extents for continued use.
>
> Once all endpoint decoders are part of a region and the region is being
> realized read the device extent list. For ease of review, this patch
> stops after reading the extent list and leaves realization of the region
> extents to a future patch.
>
> Signed-off-by: Navneet Singh <navneet.singh@xxxxxxxxx>
> Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx>
> Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx>
>
> ---
> Changes for v1:
> [iweiny: remove extent list xarray]
> [iweiny: Update spec references to 3.1]
> [iweiny: use struct range in extents]
> [iweiny: remove all reference tracking and let regions track extents
> through the extent devices.]
> [djbw/Jonathan/Fan: move extent tracking to endpoint decoders]
> ---
> drivers/cxl/core/core.h | 9 +++
> drivers/cxl/core/mbox.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/region.c | 29 +++++++
> drivers/cxl/cxlmem.h | 49 ++++++++++++
> 4 files changed, 279 insertions(+)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 91abeffbe985..119b12362977 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,6 +4,8 @@
> #ifndef __CXL_CORE_H__
> #define __CXL_CORE_H__
>
> +#include <cxlmem.h>
> +
> extern const struct device_type cxl_nvdimm_bridge_type;
> extern const struct device_type cxl_nvdimm_type;
> extern const struct device_type cxl_pmu_type;
> @@ -28,6 +30,8 @@ void cxl_decoder_kill_region(struct cxl_endpoint_decoder *cxled);
> int cxl_region_init(void);
> void cxl_region_exit(void);
> int cxl_get_poison_by_endpoint(struct cxl_port *port);
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent);

There are already functions called "cxled_", so lets not invent the
"cxl_ed_" prefix.

[..]
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 58b31fa47b93..9e33a0976828 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -870,6 +870,53 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL);
>
> +static int cxl_validate_extent(struct cxl_memdev_state *mds,
> + struct cxl_dc_extent *dc_extent)
> +{
> + struct device *dev = mds->cxlds.dev;
> + uint64_t start, len;
> +
> + start = le64_to_cpu(dc_extent->start_dpa);
> + len = le64_to_cpu(dc_extent->length);
> +
> + /* Extents must not cross region boundary's */
> + for (int i = 0; i < mds->nr_dc_region; i++) {
> + struct cxl_dc_region_info *dcr = &mds->dc_region[i];
> +
> + if (dcr->base <= start &&
> + (start + len) <= (dcr->base + dcr->decode_len)) {
> + dev_dbg(dev, "DC extent DPA %#llx - %#llx (DCR:%d:%#llx)\n",
> + start, start + len - 1, i, start - dcr->base);
> + return 0;
> + }
> + }
> +
> + dev_err_ratelimited(dev,
> + "DC extent DPA %#llx - %#llx is not in any DC region\n",
> + start, start + len - 1);

If the goal is give the admin an answer to the question "hey what
happened to the capacity I was expecting?", then this should include the
tag. Also, this is a warning, not an error, right? I.e. the driver
continues with the validated extents.

> + return -EINVAL;

This value is not returned up the stack, however, I expect EINVAL on
user input errors. For misaligned device-internal addressing, ENXIO is
more appropriate.

> +}
> +
> +static bool cxl_dc_extent_in_ed(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *extent)

How about cxled_contains_extent()?

There's no other "extents" besides "dc_extents" in the driver, and once
a symbol name goes over 2 underscores it starts to be too many tokens.

> +{
> + uint64_t start = le64_to_cpu(extent->start_dpa);
> + uint64_t length = le64_to_cpu(extent->length);
> + struct range ext_range = (struct range){
> + .start = start,
> + .end = start + length - 1,
> + };
> + struct range ed_range = (struct range) {
> + .start = cxled->dpa_res->start,
> + .end = cxled->dpa_res->end,
> + };
> +
> + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent DPA:%#llx LEN:%#llx\n",
> + cxled->dpa_res, start, length);

ED is not a standalone abbreviation anywhere else, it's either "cxled" or
"endpoint decoder". I am open to renames, but not mixed names.

For this one the decoder name is already in the printout, so no real
need to redundantly mention "ED".

Lastly, I think continued use of 'struct range' is begging for a new
enlightened format specifier. I am thinking "%par" since these things
are usually some kind of physical address, and I do not see an easy way
to extend the existing "%pr/%pR" to accommodate ranges.

> +
> + return range_contains(&ed_range, &ext_range);
> +}
> +
> void cxl_event_trace_record(const struct cxl_memdev *cxlmd,
> enum cxl_event_log_type type,
> enum cxl_event_type event_type,
> @@ -973,6 +1020,15 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds,
> return rc;
> }
>
> +static struct cxl_memdev_state *
> +cxled_to_mds(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + return container_of(cxlds, struct cxl_memdev_state, cxlds);
> +}

Looks good, makes me wonder if a cxled_to_devstate() would be a net positive
reduction in code. I think most of the current cxled_to_memdev(), just
do cxlmd->cxlds with the result.

> static void cxl_mem_get_records_log(struct cxl_memdev_state *mds,
> enum cxl_event_log_type type)
> {
> @@ -1406,6 +1462,142 @@ int cxl_dev_dynamic_capacity_identify(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dev_dynamic_capacity_identify, CXL);
>
> +static int cxl_dev_get_dc_extent_cnt(struct cxl_memdev_state *mds,
> + unsigned int *extent_gen_num)

I know the spec has this behavior where asking for zero extents returns
the total pending, but that does not really justify having this extra
step before retrieving extents.

> +{
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + struct cxl_mbox_get_dc_extent_out dc_extents;

The more I look at these patches the more I think a s/dc_extent/extent/
change is warranted to cut down on the visual token parsing reading this
code.

> + struct cxl_mbox_cmd mbox_cmd;
> + unsigned int count;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(0),
> + .start_extent_index = cpu_to_le32(0),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = sizeof(dc_extents),
> + .payload_out = &dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + count = le32_to_cpu(dc_extents.total_extent_cnt);
> + *extent_gen_num = le32_to_cpu(dc_extents.extent_list_num);

Setting aside that this function likely serves no incremental purpose,
why is the number of extents stored in a variable called "gen_num"?

> +
> + return count;
> +}
> +
> +static int cxl_dev_get_dc_extents(struct cxl_endpoint_decoder *cxled,
> + unsigned int start_gen_num,
> + unsigned int exp_cnt)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + unsigned int start_index, total_read;
> + struct device *dev = mds->cxlds.dev;
> + struct cxl_mbox_cmd mbox_cmd;
> +
> + struct cxl_mbox_get_dc_extent_out *dc_extents __free(kfree) =
> + kvmalloc(mds->payload_size, GFP_KERNEL);
> + if (!dc_extents)
> + return -ENOMEM;
> +
> + total_read = 0;
> + start_index = 0;
> + do {
> + unsigned int nr_ext, total_extent_cnt, gen_num;
> + struct cxl_mbox_get_dc_extent_in get_dc_extent;
> + int rc;
> +
> + get_dc_extent = (struct cxl_mbox_get_dc_extent_in) {
> + .extent_cnt = cpu_to_le32(exp_cnt - start_index),

Shouldn't this be something like:

.extent_cnt = cpu_to_le32(start_index ? remaining : 1),

..where @remaining is initialized at the end of the first iteration?

> + .start_extent_index = cpu_to_le32(start_index),
> + };
> +
> + mbox_cmd = (struct cxl_mbox_cmd) {
> + .opcode = CXL_MBOX_OP_GET_DC_EXTENT_LIST,
> + .payload_in = &get_dc_extent,
> + .size_in = sizeof(get_dc_extent),
> + .size_out = mds->payload_size,
> + .payload_out = dc_extents,
> + .min_out = 1,
> + };
> +
> + rc = cxl_internal_send_cmd(mds, &mbox_cmd);
> + if (rc < 0)
> + return rc;
> +
> + nr_ext = le32_to_cpu(dc_extents->ret_extent_cnt);

It occurs to me that usage of "nr_" outnumbers "_cnt" in the driver, lets
stick to the predominate style and just use "nr_" for symbol names that
represent counts and just call this nr_returned, or similar.

> + total_read += nr_ext;
> + total_extent_cnt = le32_to_cpu(dc_extents->total_extent_cnt);
> + gen_num = le32_to_cpu(dc_extents->extent_list_num);
> +
> + dev_dbg(dev, "Get extent list count:%d generation Num:%d\n",
> + total_extent_cnt, gen_num);
> +
> + if (gen_num != start_gen_num || exp_cnt != total_extent_cnt) {
> + dev_err(dev, "Possible incomplete extent list; gen %u != %u : cnt %u != %u\n",
> + gen_num, start_gen_num, exp_cnt, total_extent_cnt);
> + return -EIO;

Why fail? When the generation number has changed I would only hope that
means that the number of extents in the list has gone up, not that
previously retrieved extents have been invalidated.

So a generation number change event likely just means to retry the
retrieval starting from the end of the last generation.

> + }
> +
> + for (int i = 0; i < nr_ext ; i++) {
> + dev_dbg(dev, "Processing extent %d/%d\n",
> + start_index + i, exp_cnt);
> + rc = cxl_validate_extent(mds, &dc_extents->extent[i]);
> + if (rc)
> + continue;
> + if (!cxl_dc_extent_in_ed(cxled, &dc_extents->extent[i]))
> + continue;
> + rc = cxl_ed_add_one_extent(cxled, &dc_extents->extent[i]);
> + if (rc)
> + return rc;

I would rather this patch just claim to only validate all present
extents rather than pretend to add it. I.e. defer
cxl_ed_add_one_extent() to be defined and called later. When it comes
back a name with less tokens like cxled_add_extent() would be nice.
"one" is already assumed by non-plural "extent".

> + }
> +
> + start_index += nr_ext;
> + } while (exp_cnt > total_read);
> +
> + return 0;
> +}
> +
> +/**
> + * cxl_read_dc_extents() - Read any existing extents
> + * @cxled: Endpoint decoder which is part of a region
> + *
> + * Issue the Get Dynamic Capacity Extent List command to the device
> + * and add any existing extents found which belong to this decoder.
> + *
> + * Return: 0 if command was executed successfully, -ERRNO on error.
> + */
> +int cxl_read_dc_extents(struct cxl_endpoint_decoder *cxled)
> +{
> + struct cxl_memdev_state *mds = cxled_to_mds(cxled);
> + struct device *dev = mds->cxlds.dev;
> + unsigned int extent_gen_num;
> + int rc;
> +
> + if (!cxl_dcd_supported(mds)) {

Why is "dcd_supported" being checked again so deep in the stack? How
does an upper layer get this far into the driver without something
already noticing that dcd support is not present?

[..]
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 0d7b09a49dcf..3e563ab29afe 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1450,6 +1450,13 @@ static int cxl_region_validate_position(struct cxl_region *cxlr,
> return 0;
> }
>
> +/* Callers are expected to ensure cxled has been attached to a region */
> +int cxl_ed_add_one_extent(struct cxl_endpoint_decoder *cxled,
> + struct cxl_dc_extent *dc_extent)
> +{
> + return 0;
> +}
> +
> static int cxl_region_attach_position(struct cxl_region *cxlr,
> struct cxl_root_decoder *cxlrd,
> struct cxl_endpoint_decoder *cxled,
> @@ -2773,6 +2780,22 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
> return rc;
> }
>
> +static int cxl_region_read_extents(struct cxl_region *cxlr)
> +{
> + struct cxl_region_params *p = &cxlr->params;
> + int i;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + int rc;
> +
> + rc = cxl_read_dc_extents(p->targets[i]);

Per comment above, the targets should have already been checked for dcd
support before being added to the region.


> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> static void cxlr_dax_unregister(void *_cxlr_dax)
> {
> struct cxl_dax_region *cxlr_dax = _cxlr_dax;
> @@ -2807,6 +2830,12 @@ static int devm_cxl_add_dax_region(struct cxl_region *cxlr)
> dev_dbg(&cxlr->dev, "%s: register %s\n", dev_name(dev->parent),
> dev_name(dev));
>
> + if (cxlr->mode == CXL_REGION_DC) {
> + rc = cxl_region_read_extents(cxlr);

devm_cxl_add_dax_region() happens way after the region parameters have
been validated. I would have expected that initial extent list
validation happens earlier during region attach. This reorganization
also more naturally fits the interleave case where there will need be
cross device-validation before cxl_region_probe() runs.

[..]