Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes

From: Eugene Lubarsky
Date: Tue Aug 25 2020 - 05:59:25 EST


On Mon, 10 Aug 2020 17:41:32 +0200
Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:

> On Tue, Aug 11, 2020 at 01:27:00AM +1000, Eugene Lubarsky wrote:
> > On Mon, 10 Aug 2020 17:04:53 +0200
> > Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> And have you benchmarked any of this? Try working with the common
> tools that want this information and see if it actually is noticeable
> (hint, I have been doing that with the readfile work and it's
> surprising what the results are in places...)

Apologies for the delay. Here are some benchmarks with atop.

Patch to atop at: https://github.com/eug48/atop/commits/proc-all
Patch to add /proc/all/schedstat & cpuset below.
atop not collecting threads & cmdline as /proc/all/ doesn't support it.
10,000 processes, kernel 5.8, nested KVM, 2 cores of i7-6700HQ @ 2.60GHz

# USE_PROC_ALL=0 ./atop -w test 1 &
# pidstat -p $(pidof atop) 1

01:33:05 %usr %system %guest %wait %CPU CPU Command
01:33:06 33.66 33.66 0.00 0.99 67.33 1 atop
01:33:07 33.00 32.00 0.00 2.00 65.00 0 atop
01:33:08 34.00 31.00 0.00 1.00 65.00 0 atop
...
Average: 33.15 32.79 0.00 1.09 65.94 - atop


# USE_PROC_ALL=1 ./atop -w test 1 &
# pidstat -p $(pidof atop) 1

01:33:33 %usr %system %guest %wait %CPU CPU Command
01:33:34 28.00 14.00 0.00 1.00 42.00 1 atop
01:33:35 28.00 14.00 0.00 0.00 42.00 1 atop
01:33:36 26.00 13.00 0.00 0.00 39.00 1 atop
...
Average: 27.08 12.86 0.00 0.35 39.94 - atop

So CPU usage goes down from ~65% to ~40%.

Data collection times in milliseconds are:

# xsv cat columns proc.csv procall.csv \
> | xsv stats \
> | xsv select field,min,max,mean,stddev \
> | xsv table
field min max mean stddev
/proc time 558 625 586.59 18.29
/proc/all time 231 262 243.56 8.02

Much performance optimisation can still be done, e.g. the modified atop
uses fgets which is reading 1KB at a time, and seq_file seems to only
return 4KB pages. task_diag should be much faster still.

I'd imagine this sort of thing would be useful for daemons monitoring
large numbers of processes. I don't run such systems myself; my initial
motivation was frustration with the Kubernetes kubelet having ~2-4% CPU
usage even with a couple of containers. Basic profiling suggests syscalls
have a lot to do with it - it's actually reading loads of tiny cgroup files
and enumerating many directories every 10 seconds, but /proc has similar
issues and seemed easier to start with.

Anyway, I've read that io_uring could also help here in the near future,
which would be really cool especially if there was a way to enumerate
directories and read many files regex-style in a single operation,
e.g. /proc/[0-9].*/(stat|statm|io)

> > Currently I'm trying to re-use the existing code in fs/proc that
> > controls which PIDs are visible, but may well be missing
> > something..
>
> Try it out and see if it works correctly. And pid namespaces are not
> the only thing these days from what I call :)
>
I've tried `unshare --fork --pid --mount-proc cat /proc/all/stat`
which seems to behave correctly. ptrace flags are handled by the
existing code.


Best Wishes,
Eugene


From 2ffc2e388f7ce4e3f182c2442823e5f13bae03dd Mon Sep 17 00:00:00 2001
From: Eugene Lubarsky <elubarsky.linux@xxxxxxxxx>
Date: Tue, 25 Aug 2020 12:36:41 +1000
Subject: [RFC PATCH] fs/proc: /proc/all: add schedstat and cpuset

Signed-off-by: Eugene Lubarsky <elubarsky.linux@xxxxxxxxx>
---
fs/proc/base.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0bba4b3a985e..44d73f1ade4a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3944,6 +3944,36 @@ static int proc_all_io(struct seq_file *m, void *v)
}
#endif

+#ifdef CONFIG_PROC_PID_CPUSET
+static int proc_all_cpuset(struct seq_file *m, void *v)
+{
+ struct all_iter *iter = (struct all_iter *) v;
+ struct pid_namespace *ns = iter->ns;
+ struct task_struct *task = iter->tgid_iter.task;
+ struct pid *pid = task->thread_pid;
+
+ seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns));
+ seq_puts(m, " ");
+
+ return proc_cpuset_show(m, ns, pid, task);
+}
+#endif
+
+#ifdef CONFIG_SCHED_INFO
+static int proc_all_schedstat(struct seq_file *m, void *v)
+{
+ struct all_iter *iter = (struct all_iter *) v;
+ struct pid_namespace *ns = iter->ns;
+ struct task_struct *task = iter->tgid_iter.task;
+ struct pid *pid = task->thread_pid;
+
+ seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns));
+ seq_puts(m, " ");
+
+ return proc_pid_schedstat(m, ns, pid, task);
+}
+#endif
+
static int proc_all_statx(struct seq_file *m, void *v)
{
struct all_iter *iter = (struct all_iter *) v;
@@ -3990,6 +4020,12 @@ PROC_ALL_OPS(status);
#ifdef CONFIG_TASK_IO_ACCOUNTING
PROC_ALL_OPS(io);
#endif
+#ifdef CONFIG_SCHED_INFO
+ PROC_ALL_OPS(schedstat);
+#endif
+#ifdef CONFIG_PROC_PID_CPUSET
+ PROC_ALL_OPS(cpuset);
+#endif

#define PROC_ALL_CREATE(NAME) \
do { \
@@ -4011,4 +4047,10 @@ void __init proc_all_init(void)
#ifdef CONFIG_TASK_IO_ACCOUNTING
PROC_ALL_CREATE(io);
#endif
+#ifdef CONFIG_SCHED_INFO
+ PROC_ALL_CREATE(schedstat);
+#endif
+#ifdef CONFIG_PROC_PID_CPUSET
+ PROC_ALL_CREATE(cpuset);
+#endif
}
--
2.25.1