Re: [PATCH v7 02/13] user_events: Add minimal support for trace_event into ftrace

From: Masami Hiramatsu
Date: Sun Dec 12 2021 - 23:24:48 EST

Next message: Dae R. Jeong: "WARNING in schedule_bh"
Previous message: Matthew Wilcox: "Re: [PATCH 1/2] Remove inode_congested()"
In reply to: Beau Belgrave: "Re: [PATCH v7 02/13] user_events: Add minimal support for trace_event into ftrace"
Next in thread: Beau Belgrave: "Re: [PATCH v7 02/13] user_events: Add minimal support for trace_event into ftrace"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Beau,

On Fri, 10 Dec 2021 10:03:54 -0800
Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, Dec 10, 2021 at 07:43:58PM +0900, Masami Hiramatsu wrote:
> > Hi Beau,
> >
> > Thanks for updating the patch! I have some comments below.
> >
> > On Thu, 9 Dec 2021 14:31:59 -0800
> > Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > [..]
> > > +#define USER_EVENTS_PREFIX_LEN (sizeof(USER_EVENTS_PREFIX)-1)
> > > +
> > > +#define FIELD_DEPTH_TYPE 0
> > > +#define FIELD_DEPTH_NAME 1
> > > +#define FIELD_DEPTH_SIZE 2
> > > +
> > > +/*
> > > + * Limits how many trace_event calls user processes can create:
> > > + * Must be multiple of PAGE_SIZE.
> > > + */
> > > +#define MAX_PAGES 1
> > > +#define MAX_EVENTS (MAX_PAGES * PAGE_SIZE)
> > > +
> > > +/* Limit how long of an event name plus args within the subsystem. */
> > > +#define MAX_EVENT_DESC 512
> > > +#define EVENT_NAME(user_event) ((user_event)->tracepoint.name)
> > > +#define MAX_FIELD_ARRAY_SIZE (2 * PAGE_SIZE)
> >
> > I don't recommend to record the event which size is more than a page size...
> > Maybe 256 entries?
> > It is also better to limit the total size of the event and the number
> > of fields (arguments).
> >
> > Steve, can we write such a big event data on the trace buffer?
> >
>
> This moved to 1024 in part 12 when validation was added.

OK, then it should be done in this patch.

BTW, real maximum limitation is defined in the kernel/trace/ring_buffer.c
(I'm not sure why this is not defined in the header...)

-----
#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)

/* Max payload is BUF_PAGE_SIZE - header (8bytes) */
#define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
-----

>
> > [..]
> > > +
> > > +static int user_field_array_size(const char *type)
> > > +{
> > > + const char *start = strchr(type, '[');
> > > + char val[8];
> > > + int size = 0;
> > > +
> > > + if (start == NULL)
> > > + return -EINVAL;
> > > +
> > > + start++;
> > > +
> > > + while (*start != ']' && size < (sizeof(val) - 1))
> > > + val[size++] = *start++;
> > > +
> > > + if (*start != ']')
> > > + return -EINVAL;
> > > +
> > > + val[size] = 0;
> >
> > It's '\0', not 0.
>
> Both evaluate to 0, is this a style thing?
>
> For example, argv_split does this same thing ;)

Oops, OK. That is the style thing for clarify what you are doing.
(not initializing the element, but terminating the string)

> >
> > If I were you, I just use strlcpy(val, start, sizeof(val)), and
> > strchr(val, ']'). Sometimes using standard libc function will
> > be easer to understand what it does. :)
> >
>
> Sure good idea.
>
> [..]
>
> > > +static int user_event_add_field(struct user_event *user, const char *type,
> > > + const char *name, int offset, int size,
> > > + int is_signed, int filter_type)
> > > +{
> > > + struct ftrace_event_field *field;
> > > +
> > > + field = kmalloc(sizeof(*field), GFP_KERNEL);
> > > +
> > > + if (!field)
> > > + return -ENOMEM;
> > > +
> > > + field->type = type;
> > > + field->name = name;
> > > + field->offset = offset;
> > > + field->size = size;
> > > + field->is_signed = is_signed;
> > > + field->filter_type = filter_type;
> > > +
> > > + list_add(&field->link, &user->fields);
> >
> > I recommend to use list_add_tail() here so that when accessing the
> > list of field without reverse order. (I found this in [4/13])
> >
>
> If I did that, wouldn't that mean the format file in tracefs now has the
> arguments printed in reverse order they were added?

Ah, sorry. It was my misunderstanding. I found that the trace_event
expects the fields are chained in the reverse order.
(e.g. trace_event_get_offsets())

BTW, I think the current implementation is confusing. For example,
trace_event_get_offsets() needs a redundant explanation;
---
/*
* head->next points to the last field with the largest offset,
* since it was added last by trace_define_field()
*/
tail = list_first_entry(head, struct ftrace_event_field, link);
---
If the list is sorted in normal order, it doesn't need
such explanation, just do "tail = list_last_entry(...)"

>
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Parses the values of a field within the description
> > > + * Format: type name [size]
> >
> > Hmm, don't you accept redundant spaces and tabs?
> > If this accepts the redundant spaces/tabs, I recommend you to use
> > argv_split() instead of strpbrk() etc. e.g.
> >
> > int argc, name_idx = 0, size;
> > int ret = -EINVAL;
> > char **argv;
> >
> > argv = argv_split(GFP_KERNEL, field, &argc);
> > if (!argv)
> > return -ENOMEM;
> >
> > if (!strcmp(argv[pos], "__data_loc") ||
> > !strcmp(argv[pos], "__rel_loc")) {
> > if (++pos >= argc)
> > goto error;
> > }
> > if (!strcmp(argv[pos], "unsigned")) {
> > if (++pos >= argc)
> > goto error;
> > } else if (!strcmp(argv[pos], "struct")) {
> > is_struct = true;
> > if (++pos >= argc)
> > goto error;
> > }
> > if (++pos >= argc)
> > goto error;
> > name_idx = pos++;
> > if (pos < argc) { // size
> > if (!is_struct)
> > goto error;
> > if (kstrtou32(argv[pos++], 10, &size))
> > goto error;
> > } else
> > size = user_field_size(argv[name_idx - 1]);
> >
> > if (pos != argc)
> > goto error;
> >
> > // note that type index is always 0 and size must be converted.
> > user_event_add_field(user, argv, name_idx, saved_offset, size,
> > type[0] != 'u', FILTER_OTHER);
> >
> > ret = 0;
> > error:
> > argv_free(argv);
> > return ret;
> >
> > (This also requires to simplify user_field_size() and remove FIELD_DEPTH_*)
> > What would you think?
> >
>
> The code currently does not support duplicate spaces after the first
> non-whitespace.
>
> We do copy the string before this, so how this is written would do a
> double allocation. If the argv_split was moved higher in the callchain
> then I could move to this.

If it works and simplifies, I'm OK. But I thought the syntax required to
split a user string by ';' at first, and split each field by spaces. So I
put the argv_split() here. And anyway, this is not a hot path. I think
avoiding allocation is not such a big matter.

>
> If you feel strongly about this, I don't have a problem moving to this
> pattern. Let me know if you feel strongly about it.

I just hope to support duplicate spaces/tabs, since I guess that
users may want to write the field definition with indentation.

(Recently I hit a similar issue on another software. No one duplicates
visible separators, but spaces/tabs. :( )

> > > +
> > > +/*
> > > + * Register callback for our events from tracing sub-systems.
> > > + */
> > > +static int user_event_reg(struct trace_event_call *call,
> > > + enum trace_reg type,
> > > + void *data)
> > > +{
> > > + struct user_event *user = (struct user_event *)call->data;
> > > + int ret = 0;
> > > +
> > > + if (!user)
> > > + return -ENOENT;
> > > +
> > > + switch (type) {
> > > + case TRACE_REG_REGISTER:
> > > + ret = tracepoint_probe_register(call->tp,
> > > + call->class->probe,
> > > + data);
> > > + if (!ret)
> > > + goto inc;
> > > + break;
> > > +
> > > + case TRACE_REG_UNREGISTER:
> > > + tracepoint_probe_unregister(call->tp,
> > > + call->class->probe,
> > > + data);
> > > + goto dec;
> > > +
> > > +#ifdef CONFIG_PERF_EVENTS
> > > + case TRACE_REG_PERF_REGISTER:
> > > + case TRACE_REG_PERF_UNREGISTER:
> > > + case TRACE_REG_PERF_OPEN:
> > > + case TRACE_REG_PERF_CLOSE:
> > > + case TRACE_REG_PERF_ADD:
> > > + case TRACE_REG_PERF_DEL:
> > > + break;
> > > +#endif
> >
> > At this moment (in this patch), you can just add a default case,
> > or just ignore it, because it does nothing.
> >
>
> Yeah, I was trying to avoid the warning that resulted if I just ignored
> them.

Ah, then that's OK.

>
> > > + }
> > > +
> > > + return ret;
> > > +inc:
> > > + atomic_inc(&user->refcnt);
> > > + update_reg_page_for(user);
> > > + return 0;
> > > +dec:
> > > + update_reg_page_for(user);
> > > + atomic_dec(&user->refcnt);
> > > + return 0;
> > > +}
> > > +
> >
> > [..]
> > > +/*
> > > + * Validates the user payload and writes via iterator.
> > > + */
> > > +static ssize_t user_events_write_core(struct file *file, struct iov_iter *i)
> > > +{
> > > + struct user_event_refs *refs;
> > > + struct user_event *user = NULL;
> > > + struct tracepoint *tp;
> > > + ssize_t ret = i->count;
> > > + int idx;
> > > +
> > > + if (unlikely(copy_from_iter(&idx, sizeof(idx), i) != sizeof(idx)))
> > > + return -EFAULT;
> > > +
> > > + rcu_read_lock_sched();
> > > +
> > > + refs = rcu_dereference_sched(file->private_data);
> > > +
> > > + /*
> > > + * The refs->events array is protected by RCU, and new items may be
> > > + * added. But the user retrieved from indexing into the events array
> > > + * shall be immutable while the file is opened.
> > > + */
> > > + if (likely(refs && idx < refs->count))
> > > + user = refs->events[idx];
> > > +
> > > + rcu_read_unlock_sched();
> > > +
> > > + if (unlikely(user == NULL))
> > > + return -ENOENT;
> > > +
> > > + tp = &user->tracepoint;
> > > +
> > > + /*
> > > + * It's possible key.enabled disables after this check, however
> > > + * we don't mind if a few events are included in this condition.
> > > + */
> > > + if (likely(atomic_read(&tp->key.enabled) > 0)) {
> > > + struct tracepoint_func *probe_func_ptr;
> > > + user_event_func_t probe_func;
> > > + void *tpdata;
> > > + void *kdata;
> > > + u32 datalen;
> > > +
> > > + kdata = kmalloc(i->count, GFP_KERNEL);
> > > +
> > > + if (unlikely(!kdata))
> > > + return -ENOMEM;
> > > +
> > > + datalen = copy_from_iter(kdata, i->count, i);
> >
> > Don't we need to add this datalen to ret?
> >
>
> ret is set to the bytes that were given by the user to avoid multiple
> writes from occuring for the same data if the data was paged out (or if
> the event isn't enabled at that time for whatever reason).
>
> Since seek/partial writes are not supported, I don't believe we want to
> do that.

OK, got it.

Thank you,

>
> > > +
> > > + rcu_read_lock_sched();
> > > +
> > > + probe_func_ptr = rcu_dereference_sched(tp->funcs);
> > > +
> > > + if (probe_func_ptr) {
> > > + do {
> > > + probe_func = probe_func_ptr->func;
> > > + tpdata = probe_func_ptr->data;
> > > + probe_func(user, kdata, datalen, tpdata);
> > > + } while ((++probe_func_ptr)->func);
> > > + }
> > > +
> > > + rcu_read_unlock_sched();
> > > +
> > > + kfree(kdata);
> > > + }
> > > +
> > > + return ret;
> > > +}
> >
> > Thank you,
> >
> >
> >
> > --
> > Masami Hiramatsu <mhiramat@xxxxxxxxxx>

--
Masami Hiramatsu <mhiramat@xxxxxxxxxx>

Next message: Dae R. Jeong: "WARNING in schedule_bh"
Previous message: Matthew Wilcox: "Re: [PATCH 1/2] Remove inode_congested()"
In reply to: Beau Belgrave: "Re: [PATCH v7 02/13] user_events: Add minimal support for trace_event into ftrace"
Next in thread: Beau Belgrave: "Re: [PATCH v7 02/13] user_events: Add minimal support for trace_event into ftrace"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]