Re: [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem
From: Borislav Petkov
Date: Fri May 10 2024 - 05:26:21 EST
On Thu, May 09, 2024 at 03:59:29PM -0700, Dan Williams wrote:
> No, at a minimum that's a layering violation. This is a generic library
> facility that should not care if it is being called for a CXL device or
> an ACPI device.
Really?
Because this looks like creating a subsystem for those two newfangled
scrubbing functionalities which will be present in CXL devices and
by that ACPI RAS2 thing.
Because we have a *lot* of hw scrubbing functionality already. Just do:
git grep "scrub"
Some of it controls hw scrubbing. If this is a generic facility, does
this mean that all those older scrubbers should be converted to it?
Or is this thing going to support only new stuff? I.e., we want to
disable our scrubbers when doing performance-sensitive workloads and
reenable them after that but we don't care about old systems?
And all that other bla about controlling scrubbers from userspace.
So which is it?
> I think it works for x86 drivers because the functionality in those
> modules is wholly contained within that one module. This scrub module is
> a service library for other modules.
Well, you have that thing in EDAC. edac_core.ko is that service module
and the chipset-specific drivers - at least on x86 - use a match_id to
load only on the systems they should load on.
If I had a BIOS table which had "EDAC" in it, I won't load edac_core.ko
either but there isn't one.
> It is functionally the wrong place to do the check. When module_init()
> fails it causes not only the current module to be unloaded but any
> dependent modules will also fail to load.
See above. I'm under the assumption that this is using two methods to
detect scrubbing functionality.
> Let's take an example of the CXL driver wanting to register with this
> scrub interface to support the capability that *might* be available on
> some CXL devices. The cxl_pci.ko module, that houses cxl_pci_driver,
> grows a call to scrub_device_register(). That scrub_device_register()
> call is statically present in cxl_pci.ko so that when cxl_pci.ko loads
> symbol resolution requires scrub.ko to load.
>
> Neither of those modules (cxl_pci.ko or scrub.ko) load automatically.
> Either udev loads cxl_pci.ko because it sees a device that matches
> cxl_mem_pci_tbl, or the user manually insmods those modules because they
> think they know better. No memory wasted unless the user explicitly asks
> for memory to be wasted.
The scrub.ko goes and asks the system: "Do you have a CXL device with
scrubbing functionality?" "Yes: load." "No: ok, won't."
> If no CXL devices in the system have scrub capabilities, great, then
> scrub_device_register() will never be called.
>
> Now, if memory_scrub_control_init() did its own awkward and redundant
> CXL scan, and fails with "no CXL scrub capable devices found" it would
> not only block scrub.ko from loading, but also cxl_pci.ko since
> cxl_pci.ko needs to resolve that symbol to load.
Why would it fail the scan?
Isn't this fancy GET_SUPPORTED_FEATURES command giving you all info you
need?
> I would not say "no" to a generic facility that patches out module
> dependencies until the first call, just not sure the return on
> investment would be worth it.
>From the discussion so far yeah, I think you're right, it is not worth
the effort.
> Lastly I think drivers based on ACPI tables are awkward. They really
> need to have an ACPI device to attach so that typical automatic Linux
> module loading machinery can be used. The fact this function is a
> subsys_initcall() is a red-flag since nothing should be depending on the
> load order of a little driver to control scrub parameters.
Yeah, it is becoming a mess before it has even started.
So I don't mind if such drivers get loaded as long as doing this is the
best we can do given the situation. What gets me up the palms, as they
say in .de, is "just because" and "look, the others do it too."
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette