[RFC][PATCH 00/10] taskstats: Enhancements for precise accounting

From: Michael Holzheu
Date: Thu Sep 23 2010 - 09:48:14 EST


Currently tools like "top" gather the task information by reading procfs
files. This has several disadvantages:

* It is very CPU intensive, because a lot of system calls (readdir, open,
read, close) are necessary.
* No real task snapshot can be provided, because while the procfs files are
read the system continues running.
* The procfs times granularity is restricted to jiffies.

In parallel to procfs there exists the taskstats binary interface that uses
netlink sockets as transport mechanism to deliver task information to
user space. There exists a taskstats command "TASKSTATS_CMD_ATTR_PID"
to get task information for a given PID. This command can already be used for
tools like top, but has also several disadvantages:

* You first have to find out which PIDs are available in the system. Currently
we have to use procfs again to do this.
* For each task two system calls have to be issued (First send the command and
then receive the reply).
* No snapshot mechanism is available.

GOALS OF THIS PATCH SET
-----------------------
The intention of this patch set is to provide better support for tools like
top. The goal is to:

* provide a task snapshot mechanism where we can get a consistent view of
all running tasks.
* provide a transport mechanism that does not require a lot of system calls
and that allows implementing low CPU overhead task monitoring.
* provide microsecond CPU time granularity.

FIRST RESULTS
-------------
Together with this kernel patch set also user space code for a new top
utility (ptop) is provided that exploits the new kernel infrastructure. See
patch 10 for more details.

TEST1: System with many sleeping tasks

for ((i=0; i < 1000; i++))
do
sleep 1000000 &
done

# ptop_new_proc

VVVV
pid user sys ste total Name
(#) (%) (%) (%) (%) (str)
541 0.37 2.39 0.10 2.87 top
3743 0.03 0.05 0.00 0.07 ptop_new_proc
^^^^

Compared to the old top command that has to scan more than 1000 proc
directories the new ptop consumes much less CPU time (0.05% system time
on my s390 system).

TEST2: Show snapshot consistency with system that is 100% busy

System with 3 CPUs:

for ((i=0; i < $(cat /proc/cpuinfo | grep "^processor" | wc -l); i++))
do
./loop &
done

# ptop_snap_proc

VVVV VVV VVV VVVVV
pid user sys ste cuser csys cste delay total Elap+ Name
(#) (%) (%) (%) (%) (%) (%) (%) (%) (hm) (str)
23891 99.84 0.06 0.09 0.00 0.00 0.00 0.01 99.99 0:00 loop
23881 99.66 0.06 0.09 0.00 0.00 0.00 0.20 99.81 0:00 loop
23886 99.65 0.06 0.09 0.00 0.00 0.00 0.20 99.80 0:00 loop
2413 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 4:17 sshd
...
V:V:S 299.36 0.36 0.27 0.00 0.00 0.00 0.40 300.00 4:22
^^^^^^

With the snapshot mechanism the sum of all tasks CPU times (user + system +
steal) will be exactly 300.00% CPU time with this testcase. Using
ptop_snap_proc (see patch 10) this works fine on s390.

PATCHSET OVERVIEW
-----------------
The code is not final and still has a few TODOs. But it is good enough for a
first round of review. The following kernel patches are provided:

[01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
[02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
more easily.
[03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
filling the taskstats.
[04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
tasks.
[05] Add procfs interface for taskstats commands. This allows to get a complete
and consistent snapshot with all tasks using two system calls (ioctl and
read). Transferring a snapshot of all running tasks is not possible using
the existing netlink interface, because there we have the socket buffer
size as restricting factor.
[06] Add TGID to taskstats.
[07] Add steal time per task accounting.
[08] Add cumulative CPU time (user, system and steal) to taskstats.
[09] Fix exit CPU time accounting.

[10] Besides of the kernel patches also user space code is provided that
exploits the new kernel infrastructure. The user space code provides the
following:
1. A proposal for a taskstats user space library:
1.1 Based on netlink (requires libnl-devel-1.1-5)
2.1 Based on the new /proc/taskstats interface (see [05])
2. A proposal for a task snapshot library based on taskstats library (1.1)
3. A new tool "ptop" (precise top) that uses the libraries


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/