Re: [PATCH] kobject: record and send PREV_SEQNUM with uevents

From: Peter Rajnoha
Date: Wed Oct 18 2017 - 03:22:03 EST


On 10/17/2017 03:26 PM, Greg KH wrote:
> On Tue, Oct 17, 2017 at 12:04:41PM +0200, Peter Rajnoha wrote:
>> On 10/16/2017 04:49 PM, Greg KH wrote:
>>> On Mon, Oct 16, 2017 at 04:00:26PM +0200, Peter Rajnoha wrote:
>>>> On 10/16/2017 03:55 PM, Greg KH wrote:
>>>>> On Mon, Oct 16, 2017 at 02:35:44PM +0200, Peter Rajnoha wrote:
>>>>>> Record last uevent seqnum that was used with kobject's uevent and send
>>>>>> it with next uevent as PREV_SEQNUM=<num> variable.
>>>>>>
>>>>>> This enhances the way we are able to track and handle uevents in userspace
>>>>>> if we are trying to catch up with all the uevents that were generated
>>>>>> while we were not listening or the uevents which were not delivered.
>>>>>> This may happen if the userspace listener is not running yet when the
>>>>>> uevent is triggered or there's a period of time when it's not listening
>>>>>> (e.g. because the userspace listener is being restarted while a new
>>>>>> uevent fires).
>>>>>>
>>>>>> A userspace listener can compare the last uevent seqnum it knows about
>>>>>> with the last uevent seqnum the kernel reports through uevents delivered
>>>>>> to userspace, both genuine and synthetic ones (the ones generated by
>>>>>> writing uevent action name to /sys/.../uevent file). Then, if needed,
>>>>>> userspace may execute appropriate actions based on that. We don't need
>>>>>> to reevaluate and recheck all items, instead, we can concentrate only
>>>>>> on items for which we really and actually need to reflect changed state
>>>>>> in kernel while the userspace listener was not listening for uevents.
>>>>>>
>>>>>> Signed-off-by: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>>>>>> ---
>>>>>> include/linux/kobject.h | 1 +
>>>>>> lib/kobject.c | 1 +
>>>>>> lib/kobject_uevent.c | 15 +++++++++++++--
>>>>>> 3 files changed, 15 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/kobject.h b/include/linux/kobject.h
>>>>>> index e0a6205caa71..ee6db28b6186 100644
>>>>>> --- a/include/linux/kobject.h
>>>>>> +++ b/include/linux/kobject.h
>>>>>> @@ -73,6 +73,7 @@ struct kobject {
>>>>>> #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
>>>>>> struct delayed_work release;
>>>>>> #endif
>>>>>> + u64 last_uevent_seqnum;
>>>>>> unsigned int state_initialized:1;
>>>>>> unsigned int state_in_sysfs:1;
>>>>>> unsigned int state_add_uevent_sent:1;
>>>>>> diff --git a/lib/kobject.c b/lib/kobject.c
>>>>>> index 763d70a18941..2cc809f10ae7 100644
>>>>>> --- a/lib/kobject.c
>>>>>> +++ b/lib/kobject.c
>>>>>> @@ -190,6 +190,7 @@ static void kobject_init_internal(struct kobject *kobj)
>>>>>> return;
>>>>>> kref_init(&kobj->kref);
>>>>>> INIT_LIST_HEAD(&kobj->entry);
>>>>>> + kobj->last_uevent_seqnum = 0;
>>>>>> kobj->state_in_sysfs = 0;
>>>>>> kobj->state_add_uevent_sent = 0;
>>>>>> kobj->state_remove_uevent_sent = 0;
>>>>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>>>>> index f237a09a5862..f1cbed06a395 100644
>>>>>> --- a/lib/kobject_uevent.c
>>>>>> +++ b/lib/kobject_uevent.c
>>>>>> @@ -454,8 +454,19 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
>>>>>> }
>>>>>>
>>>>>> mutex_lock(&uevent_sock_mutex);
>>>>>> - /* we will send an event, so request a new sequence number */
>>>>>> - retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
>>>>>> + /*
>>>>>> + * We will send an event, so request a new sequence number.
>>>>>> + * Also, record the number for next uevent's PREV_SEQNUM.
>>>>>> + */
>>>>>> + kobj->last_uevent_seqnum = ++uevent_seqnum;
>>>>>> + retval = add_uevent_var(env, "SEQNUM=%llu",
>>>>>> + (unsigned long long)uevent_seqnum);
>>>>>> + if (retval) {
>>>>>> + mutext_unlock(&uevent_sock_mutex);
>>>>>> + goto exit;
>>>>>> + }
>>>>>> + retval = add_uevent_var(env, "PREV_SEQNUM=%llu",
>>>>>> + (unsigned long long)kobj->last_uevent_seqnum);
>>>>>
>>>>> I'm confused, why are we storing "last_uevent_seqnum" here at all, if it
>>>>> is only just the current one-1? What else do you do with that value?
>>>> I'm sorry, I've sent incorrect patch by mistake. This is the correct one:
>>>
>>> Ok, this is a bit better (but it's not in a format I could apply it in,
>>> if I wanted to...)
>>>
>>>> From 91fa0d0f5c52b959a5ca4cc79d1a4f5970db75e2 Mon Sep 17 00:00:00 2001
>>>> From: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>>>> Date: Thu, 12 Oct 2017 11:33:48 +0200
>>>> Subject: [PATCH] kobject: record and send PREV_SEQNUM with uevents
>>>>
>>>> Record last uevent seqnum that was used with kobject's uevent and send
>>>> it with next uevent as PREV_SEQNUM=<num> variable.
>>>>
>>>> This enhances the way we are able to track and handle uevents in userspace
>>>> if we are trying to catch up with all the uevents that were generated
>>>> while we were not listening or the uevents which were not delivered.
>>>> This may happen if the userspace listener is not running yet when the
>>>> uevent is triggered or there's a period of time when it's not listening
>>>> (e.g. because the userspace listener is being restarted while a new
>>>> uevent fires).
>>>>
>>>> A userspace listener can compare the last uevent seqnum it knows about
>>>> with the last uevent seqnum the kernel reports through uevents delivered
>>>> to userspace, both genuine and synthetic ones (the ones generated by
>>>> writing uevent action name to /sys/.../uevent file). Then, if needed,
>>>> userspace may execute appropriate actions based on that. We don't need
>>>> to reevaluate and recheck all items, instead, we can concentrate only
>>>> on items for which we really and actually need to reflect changed state
>>>> in kernel while the userspace listener was not listening for uevents.
>>>>
>>>> Signed-off-by: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>>>
>>> I don't really understand what userspace can do with this. Why does it
>>> matter what the sequence number was for a specific kobject? What can
>>> happen because of this?
>>>
>>> Do you have any example libudev code changes to show this in action so I
>>> can maybe understand the need?
>>>
>>
>> Not a libudev code change for this, at least I don't have this in plan
>> at this moment.
>>
>> It's more related to the way the udev rules are processed and, in
>> general, the way any other uevent listener can react to uevents. So the
>> change is in how "end consumer" of the uevent does its processing.
>>
>> At the moment, the sequence number is used globally, we don't have
>> per-kobject sequence numbers. So we can't use current SEQNUM that is
>> attached to each uevent to tell whether we have missed previous
>> uevent(s) for concrete device or not, we just know that some uevents for
>> some devices were not processed. Therefore, when we're triggering
>> uevents from userspace (udevadm trigger), we need to reevaluate all the
>> rules for all devices, no matter if it's really needed or not.
>>
>> For example, when we're switching from initramfs to root fs, we stop the
>> first udevd instance which is in initramfs, we keep the database (at the
>> moment, only for selected devices where we know that it's safe to do so
>> - same rules used in initramfs as in root fs). Then we switch over to
>> root fs and start the udevd again and we trigger uevents to fully
>> refresh the state, taking also the existing database into account.
>>
>> Now, it would help if we knew whether there were any uevents issued
>> after we stopped udevd in initramfs and before we started a new one from
>> root fs. If there are no uevents in between, we don't need to reevaluate
>> all the rules, we can just keep the existing database content. This can
>> become prominent if we have more devices in the system. It helps to
>> minimize the number of execution during boot, the overall impact of
>> triggering uevents and related processing, hence also making it more
>> straightforward to watch the state of the device and possibly debug
>> related problems.
>>
>> The same principle applies to any uevent listener in userspace, not just
>> udevd:
>>
>> - recording last seen SEQNUM
>>
>> - when new instance of the listener is running
>>
>> - trigger synthetic uevents to refresh state
>>
>> - comparing the PREV_SEQNUM from uevent generated with stored SEQNUM:
>
> Why can't you just compare SEQNUM with the "last seen" SEQNUM, they
> should just be 1 apart, right? Isn't that what is done today? How will
> this differ from looking at PREV_SEQNUM?

The SEQNUM is global, not per device (per kobject). It's 1 apart if you
have only 1 device, but that doesn't apply if you have more devices with
possible uevents interleaved.

For example (just provoking uevents here for clear example):

echo add > /sys/block/sda/uevent
echo add > /sys/block/sda/uevent
echo add > /sys/block/sdb/uevent
echo add > /sys/block/sda/uevent

Generates these uevents (I've removed uninteresting parts):

KERNEL[45.739557] add
ACTION=add
DEVNAME=/dev/sda
PREV_SEQNUM=2096
SEQNUM=2112

KERNEL[55.131767] add
ACTION=add
DEVNAME=/dev/sda
PREV_SEQNUM=2112
SEQNUM=2113
SUBSYSTEM=block

KERNEL[59.236131] add
ACTION=add
DEVNAME=/dev/sdb
PREV_SEQNUM=2097
SEQNUM=2114

KERNEL[61.708551] add
ACTION=add
DEVNAME=/dev/sda
PREV_SEQNUM=2113
SEQNUM=2115

Here, the PREV_SEQNUM is 1 apart with SEQNUM for the 2nd uevent (sda),
because there was no other uevent (for a different device). But if you
check the 4th uevent (sda) where previous one was generated for another
device (sdb), the PREV_SEQNUM is not 1 apart. So in general, it may be
"n" apart from previous SEQNUM for a *particular device*. Globally, for
all devices, it is 1 apart, yes, but we don't care about that as that is
not quite useful.

So without PREV_SEQNUM, we can detect we have missed uevents, but we
don't know for which devices exactly, so we need to refresh everything,
not just the items we actually need. With PREV_SEQNUM we can identify
exact devices.

>
>> - If they differ, refresh the state fully, considering the state
>> we know of as obsolete as there were already other uevents before we
>> haven't processed. This may include deeper scans which may take longer,
>> executing more helper tools to get the state of the device.
>
> But that is what "last seen" also shows.

Not per device, but only globally. So without PREV_SEQNUM it's either
"refresh all or nothing".

>
>> - If the are same, reuse the state we already know about is
>> up-to-date (there were no uevents we haven't seen), don't reevaluate
>> everything.
>
> Testing for +1 should be the same.
>
>> With the new synthetic uevent interface (the extention to
>> /sys/.../uevent) that accepts arguments (UUID and key=value pairs), we
>> can make it possible to specify the nature of the triggered uevent, we
>> can mark it through the key=value pairs which the uevent listener in
>> userspace can check for, so if we can also:
>
> but that doesn't matter for PREV_SEQNUM.

I've mentioned that only for completeness and in advance, for the ones
who would argue that there can be a need to force full refresh for all
devices even though it may not be needed based on PREV_SEQNUM which will
cause selective refresh. So there's still a way to fallback to original
behavior, if that's needed.

>
>> - either refresh only the state for devices where we missed uevents
>> (that could be used during bootup, only to refresh the state for devices
>> where we have no current state yet), comparing current PREV_SEQNUM and
>> stored last SEQNUM in userspace
>>
>> - or to force full refresh if that's needed for a reason.
>>
>> Also, some devices may have more complex activation scheme than usual
>> "device addition" to the system. There can be a sequence of uevents
>> which are expected for a successful change of device state, so we have a
>> state machine that tracks uevents (as a confirmation from kernel for
>> userspace that we can hook on in udev rules and uevent listeners). And
>> here, it's very helpful to know that we have missed uevent(s) before or
>> not, either they were not delivered or the uevent listener itself has
>> been restarted and during this period, it may have missed uevents.
>>
>> In summary, it's adding more robustness to the uevent interface where we
>> can directly and quickly detect we missed the uevent or not for a
>> specific device and hence we can optimize udev rules/uevent listeners
>> based on that so we don't dully refresh all the state and reexecute all
>> rules and code for any triggered uevents. It's really not needed all the
>> time.
>
> I'd like to see real, working code that uses this new interface, showing
> how it is not possible to do what is needed without the PREV_SEQNUM,
> before considering adding this. Otherwise it's just stuff we have to
> drag around for forever.
>

OK... well, it's about changing concrete rules to skip various calls,
scans (e.g. blkid in case of block devices) and handling which is not
necessary if we detect that the uevent is synthetic and it's just used
to catch up with any missed uevents. In that case, we'll reuse existing
records. The same applies to any uevent listener besides udevd and its
records which are managed based on uevents.

--
Peter