Re: [PATCH v6 1/3] perf: add PERF_RECORD_NAMESPACES to include namespaces related info

From: Hari Bathini
Date: Sun Feb 19 2017 - 23:12:20 EST


Hi Peter,


On Thursday 16 February 2017 03:58 PM, Peter Zijlstra wrote:
On Wed, Feb 08, 2017 at 02:01:24PM +0530, Hari Bathini wrote:
With the advert of container technologies like docker, that depend
on namespaces for isolation, there is a need for tracing support for
namespaces. This patch introduces new PERF_RECORD_NAMESPACES event
for tracing based on namespaces related info. This event records
the device and inode numbers for every namespace of all processes.

While device number is same for all namespaces currently, that may
'While device numbers are the same ..." ?

change in future, to avoid the need for a namespace of namespaces.
Unfinished sentence

Unnecessary comma spoiled the sentence, I guess.

Recording device number along with inode number will take care of such
scenario.
More words on why this is so? Because I've no clue.

This should have read:

"Each namespace has a combination of device and inode numbers. Though
every namespace has the same device number currently, that may change
in future to avoid the need for a namespace of namespaces. Considering
such possibility, record both device and inode numbers separately for
each namespace."

Will update changelog accordingly..

@@ -6584,6 +6590,135 @@ void perf_event_comm(struct task_struct *task, bool exec)
}
/*
+ * namespaces tracking
+ */
+
+struct namespaces_event_id {
+ struct perf_event_header header;
+
+ u32 pid;
+ u32 tid;
+ u32 nr_namespaces;
+ struct perf_ns_link_info link_info[NAMESPACES_MAX];
+};
Contains the same hole and makes the record depend on architecture
alignment requirements.

Ugh! Let me change nr_namespaces type to u64 and also drop name field
in perf_ns_link_info struct.

+static void perf_fill_ns_link_info(struct perf_ns_link_info *ns_link_info,
+ struct task_struct *task,
+ const struct proc_ns_operations *ns_ops)
+{
+ struct path ns_path;
+ struct inode *ns_inode;
+ void *error;
+
+ error = ns_get_path(&ns_path, task, ns_ops);
+ if (!error) {
+ snprintf(ns_link_info->name, NS_NAME_SIZE,
+ "%s", ns_path.dentry->d_iname);
+
+ ns_inode = ns_path.dentry->d_inode;
+ ns_link_info->dev = new_encode_dev(ns_inode->i_sb->s_dev);
+ ns_link_info->ino = ns_inode->i_ino;
+ }
+}
+
+static void perf_event_namespaces_output(struct perf_event *event,
+ void *data)
+{
+ struct perf_namespaces_event *namespaces_event = data;
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ struct namespaces_event_id *ei;
+ struct task_struct *task = namespaces_event->task;
+ int ret;
+
+ if (!perf_event_namespaces_match(event))
+ return;
+
+ ei = &namespaces_event->event_id;
+ perf_event_header__init_id(&ei->header, &sample, event);
+ ret = perf_output_begin(&handle, event, ei->header.size);
+ if (ret)
+ return;
+
+ ei->pid = perf_event_pid(event, task);
+ ei->tid = perf_event_tid(event, task);
+
+ ei->nr_namespaces = NAMESPACES_MAX;
+
+ perf_fill_ns_link_info(&ei->link_info[MNT_NS_INDEX],
+ task, &mntns_operations);
+
+#ifdef CONFIG_USER_NS
+ perf_fill_ns_link_info(&ei->link_info[USER_NS_INDEX],
+ task, &userns_operations);
+#endif
+#ifdef CONFIG_NET_NS
+ perf_fill_ns_link_info(&ei->link_info[NET_NS_INDEX],
+ task, &netns_operations);
+#endif
+#ifdef CONFIG_UTS_NS
+ perf_fill_ns_link_info(&ei->link_info[UTS_NS_INDEX],
+ task, &utsns_operations);
+#endif
+#ifdef CONFIG_IPC_NS
+ perf_fill_ns_link_info(&ei->link_info[IPC_NS_INDEX],
+ task, &ipcns_operations);
+#endif
+#ifdef CONFIG_PID_NS
+ perf_fill_ns_link_info(&ei->link_info[PID_NS_INDEX],
+ task, &pidns_operations);
+#endif
+#ifdef CONFIG_CGROUPS
+ perf_fill_ns_link_info(&ei->link_info[CGROUP_NS_INDEX],
+ task, &cgroupns_operations);
+#endif
Two points here, since you've given these thingies a string identifier,
do you still need to rely on their positional index? If not, you could
generate smaller events if we lack some of these CONFIG knobs.

Dropping name field. So, I don't think this applies now..

Secondly, does anything in here depend on @event? Afaict you're
generating the exact same information for each event.

The reason we set {pid,tid} for each event is because of PID_NS, we
report the pid number as per the namespace associated with the event.

But from what I can tell, there is no such namespace of namespaces
dependency here and this should live in the function below.

Right. I will move it to perf_event_namespaces() function.

@@ -9667,6 +9804,11 @@ SYSCALL_DEFINE5(perf_event_open,
return -EACCES;
}
+ if (attr.namespaces) {
+ if (!capable(CAP_SYS_ADMIN))
+ return -EACCES;
+ }
Many times we allow users access to such information; should this be
qualified with something like perf_paranoid_kernel() or somesuch?

Hmmm.. Need access to namespace info of every process. Synthesized events
try and access /proc/1/ns/ and the like needing CAP_SYS_ADMIN. So, the
question is, should running perf tool be allowed with perf_paranoid_kernel()
check and complain later when access to /proc/1/ns and the like fails (on running
perf tool without CAP_SYS_ADMIN)?

+
if (attr.freq) {
if (attr.sample_freq > sysctl_perf_event_sample_rate)
return -EINVAL;
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8a..fd77e67 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2289,6 +2289,9 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
free_fs_struct(new_fs);
bad_unshare_out:
+ if (!err)
+ perf_event_namespaces(current);
+
return err;
}
That seems like a very odd place, why not put it right before the
bad_unshare_cleanup_cred: label?


@@ -264,6 +265,10 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
switch_task_namespaces(tsk, new_nsproxy);
out:
fput(file);
+
+ if (!err)
+ perf_event_namespaces(tsk);
+
return err;
}
Same again..

My bad! Will change that..

Thanks
Hari