Re: [PATCH RFC v2 3/3] cxl/memdev: Register for and process CPER events

From: Dan Williams
Date: Mon Oct 30 2023 - 18:02:43 EST


Smita Koralahalli wrote:
[..]
> > +#define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0)
> > +int cxl_cper_event_call(struct notifier_block *nb, unsigned long action, void *data)
> > +{
> > + struct cxl_cper_notifier_data *nd = data;
> > + struct cxl_event_record_raw record = (struct cxl_event_record_raw) {
> > + .hdr.id = UUID_INIT(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
> > + };
> > + enum cxl_event_log_type log_type;
> > + struct cxl_memdev_state *mds;
> > + u32 hdr_flags;
> > +
> > + mds = container_of(nb, struct cxl_memdev_state, cxl_cper_nb);
> > +
> > + /* Need serial number for device identification */
> > + if (!(nd->rec->hdr.validation_bits & CPER_CXL_DEVICE_SN_VALID))
> > + return NOTIFY_DONE;
>
> For all the event records that I tested so far, this has never been
> true. That is CPER_CXL_DEVICE_SN_VALID is never set which might not log
> the records at all. Should we be bit more lenient here and include
> validating device_id (bdf) instead and check if cxlds exist?

Agree. While I do think those devices are out of spec given CXL mandates
a valid serial number, I think the robustness priciple applies and Linux
should rely on bdf information. I also expect that with MH-SLDs and
potentially other scenarios, a serial number may be duplicated so bdf is
more reliable in that dimension as well.

> > + /* FIXME endianess and bytes of serial number need verification */
> > + /* FIXME Should other values be checked? */
> > + if (memcmp(&mds->cxlds.serial, &nd->rec->hdr.dev_serial_num,
> > + sizeof(mds->cxlds.serial)))
> > + return NOTIFY_DONE;
> > +
> > + /* ensure record can always handle the full CPER provided data */
> > + BUILD_BUG_ON(sizeof(record) <
> > + (CPER_CXL_COMP_EVENT_LOG_SIZE + sizeof(record.hdr.id)));
> > +
> > + /*
> > + * UEFI v2.10 defines N.2.14 defines the CXL CPER record as not
> > + * including the uuid field.
> > + */
> > + memcpy(&record.hdr.length, &nd->rec->comp_event_log,
> > + CPER_CXL_REC_LEN(nd->rec));
>
> I'm doubtful this will do the job. I think we should copy into each
> field of struct cxl_event_record_hdr individually starting from length
> by pointer arithmetic (which is definitely bad, but I cannot think of a
> better way to do this) and then do memcpy for data field in struct
> cxl_event_record_raw..
>
> Any other suggestions would be helpful as well.
>
> I can make these changes and validate it on my end if that works..?

It sounds like you have a more readily available real world test
environment for this, so that sounds good to me.