Re: [PATCH] kobject: record and send PREV_SEQNUM with uevents

From: Peter Rajnoha
Date: Tue Oct 17 2017 - 06:04:49 EST


On 10/16/2017 04:49 PM, Greg KH wrote:
> On Mon, Oct 16, 2017 at 04:00:26PM +0200, Peter Rajnoha wrote:
>> On 10/16/2017 03:55 PM, Greg KH wrote:
>>> On Mon, Oct 16, 2017 at 02:35:44PM +0200, Peter Rajnoha wrote:
>>>> Record last uevent seqnum that was used with kobject's uevent and send
>>>> it with next uevent as PREV_SEQNUM=<num> variable.
>>>>
>>>> This enhances the way we are able to track and handle uevents in userspace
>>>> if we are trying to catch up with all the uevents that were generated
>>>> while we were not listening or the uevents which were not delivered.
>>>> This may happen if the userspace listener is not running yet when the
>>>> uevent is triggered or there's a period of time when it's not listening
>>>> (e.g. because the userspace listener is being restarted while a new
>>>> uevent fires).
>>>>
>>>> A userspace listener can compare the last uevent seqnum it knows about
>>>> with the last uevent seqnum the kernel reports through uevents delivered
>>>> to userspace, both genuine and synthetic ones (the ones generated by
>>>> writing uevent action name to /sys/.../uevent file). Then, if needed,
>>>> userspace may execute appropriate actions based on that. We don't need
>>>> to reevaluate and recheck all items, instead, we can concentrate only
>>>> on items for which we really and actually need to reflect changed state
>>>> in kernel while the userspace listener was not listening for uevents.
>>>>
>>>> Signed-off-by: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>>>> ---
>>>> include/linux/kobject.h | 1 +
>>>> lib/kobject.c | 1 +
>>>> lib/kobject_uevent.c | 15 +++++++++++++--
>>>> 3 files changed, 15 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/kobject.h b/include/linux/kobject.h
>>>> index e0a6205caa71..ee6db28b6186 100644
>>>> --- a/include/linux/kobject.h
>>>> +++ b/include/linux/kobject.h
>>>> @@ -73,6 +73,7 @@ struct kobject {
>>>> #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
>>>> struct delayed_work release;
>>>> #endif
>>>> + u64 last_uevent_seqnum;
>>>> unsigned int state_initialized:1;
>>>> unsigned int state_in_sysfs:1;
>>>> unsigned int state_add_uevent_sent:1;
>>>> diff --git a/lib/kobject.c b/lib/kobject.c
>>>> index 763d70a18941..2cc809f10ae7 100644
>>>> --- a/lib/kobject.c
>>>> +++ b/lib/kobject.c
>>>> @@ -190,6 +190,7 @@ static void kobject_init_internal(struct kobject *kobj)
>>>> return;
>>>> kref_init(&kobj->kref);
>>>> INIT_LIST_HEAD(&kobj->entry);
>>>> + kobj->last_uevent_seqnum = 0;
>>>> kobj->state_in_sysfs = 0;
>>>> kobj->state_add_uevent_sent = 0;
>>>> kobj->state_remove_uevent_sent = 0;
>>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>>> index f237a09a5862..f1cbed06a395 100644
>>>> --- a/lib/kobject_uevent.c
>>>> +++ b/lib/kobject_uevent.c
>>>> @@ -454,8 +454,19 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
>>>> }
>>>>
>>>> mutex_lock(&uevent_sock_mutex);
>>>> - /* we will send an event, so request a new sequence number */
>>>> - retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
>>>> + /*
>>>> + * We will send an event, so request a new sequence number.
>>>> + * Also, record the number for next uevent's PREV_SEQNUM.
>>>> + */
>>>> + kobj->last_uevent_seqnum = ++uevent_seqnum;
>>>> + retval = add_uevent_var(env, "SEQNUM=%llu",
>>>> + (unsigned long long)uevent_seqnum);
>>>> + if (retval) {
>>>> + mutext_unlock(&uevent_sock_mutex);
>>>> + goto exit;
>>>> + }
>>>> + retval = add_uevent_var(env, "PREV_SEQNUM=%llu",
>>>> + (unsigned long long)kobj->last_uevent_seqnum);
>>>
>>> I'm confused, why are we storing "last_uevent_seqnum" here at all, if it
>>> is only just the current one-1? What else do you do with that value?
>> I'm sorry, I've sent incorrect patch by mistake. This is the correct one:
>
> Ok, this is a bit better (but it's not in a format I could apply it in,
> if I wanted to...)
>
>> From 91fa0d0f5c52b959a5ca4cc79d1a4f5970db75e2 Mon Sep 17 00:00:00 2001
>> From: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>> Date: Thu, 12 Oct 2017 11:33:48 +0200
>> Subject: [PATCH] kobject: record and send PREV_SEQNUM with uevents
>>
>> Record last uevent seqnum that was used with kobject's uevent and send
>> it with next uevent as PREV_SEQNUM=<num> variable.
>>
>> This enhances the way we are able to track and handle uevents in userspace
>> if we are trying to catch up with all the uevents that were generated
>> while we were not listening or the uevents which were not delivered.
>> This may happen if the userspace listener is not running yet when the
>> uevent is triggered or there's a period of time when it's not listening
>> (e.g. because the userspace listener is being restarted while a new
>> uevent fires).
>>
>> A userspace listener can compare the last uevent seqnum it knows about
>> with the last uevent seqnum the kernel reports through uevents delivered
>> to userspace, both genuine and synthetic ones (the ones generated by
>> writing uevent action name to /sys/.../uevent file). Then, if needed,
>> userspace may execute appropriate actions based on that. We don't need
>> to reevaluate and recheck all items, instead, we can concentrate only
>> on items for which we really and actually need to reflect changed state
>> in kernel while the userspace listener was not listening for uevents.
>>
>> Signed-off-by: Peter Rajnoha <prajnoha@xxxxxxxxxx>
>
> I don't really understand what userspace can do with this. Why does it
> matter what the sequence number was for a specific kobject? What can
> happen because of this?
>
> Do you have any example libudev code changes to show this in action so I
> can maybe understand the need?
>

Not a libudev code change for this, at least I don't have this in plan
at this moment.

It's more related to the way the udev rules are processed and, in
general, the way any other uevent listener can react to uevents. So the
change is in how "end consumer" of the uevent does its processing.

At the moment, the sequence number is used globally, we don't have
per-kobject sequence numbers. So we can't use current SEQNUM that is
attached to each uevent to tell whether we have missed previous
uevent(s) for concrete device or not, we just know that some uevents for
some devices were not processed. Therefore, when we're triggering
uevents from userspace (udevadm trigger), we need to reevaluate all the
rules for all devices, no matter if it's really needed or not.

For example, when we're switching from initramfs to root fs, we stop the
first udevd instance which is in initramfs, we keep the database (at the
moment, only for selected devices where we know that it's safe to do so
- same rules used in initramfs as in root fs). Then we switch over to
root fs and start the udevd again and we trigger uevents to fully
refresh the state, taking also the existing database into account.

Now, it would help if we knew whether there were any uevents issued
after we stopped udevd in initramfs and before we started a new one from
root fs. If there are no uevents in between, we don't need to reevaluate
all the rules, we can just keep the existing database content. This can
become prominent if we have more devices in the system. It helps to
minimize the number of execution during boot, the overall impact of
triggering uevents and related processing, hence also making it more
straightforward to watch the state of the device and possibly debug
related problems.

The same principle applies to any uevent listener in userspace, not just
udevd:

- recording last seen SEQNUM

- when new instance of the listener is running

- trigger synthetic uevents to refresh state

- comparing the PREV_SEQNUM from uevent generated with stored SEQNUM:

- If they differ, refresh the state fully, considering the state
we know of as obsolete as there were already other uevents before we
haven't processed. This may include deeper scans which may take longer,
executing more helper tools to get the state of the device.

- If the are same, reuse the state we already know about is
up-to-date (there were no uevents we haven't seen), don't reevaluate
everything.

With the new synthetic uevent interface (the extention to
/sys/.../uevent) that accepts arguments (UUID and key=value pairs), we
can make it possible to specify the nature of the triggered uevent, we
can mark it through the key=value pairs which the uevent listener in
userspace can check for, so if we can also:

- either refresh only the state for devices where we missed uevents
(that could be used during bootup, only to refresh the state for devices
where we have no current state yet), comparing current PREV_SEQNUM and
stored last SEQNUM in userspace

- or to force full refresh if that's needed for a reason.

Also, some devices may have more complex activation scheme than usual
"device addition" to the system. There can be a sequence of uevents
which are expected for a successful change of device state, so we have a
state machine that tracks uevents (as a confirmation from kernel for
userspace that we can hook on in udev rules and uevent listeners). And
here, it's very helpful to know that we have missed uevent(s) before or
not, either they were not delivered or the uevent listener itself has
been restarted and during this period, it may have missed uevents.

In summary, it's adding more robustness to the uevent interface where we
can directly and quickly detect we missed the uevent or not for a
specific device and hence we can optimize udev rules/uevent listeners
based on that so we don't dully refresh all the state and reexecute all
rules and code for any triggered uevents. It's really not needed all the
time.

--
Peter