Re: [PATCH v2] docs: add system-state document to admin-guide

From: Jonathan Corbet
Date: Thu Mar 23 2023 - 13:55:42 EST


Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx> writes:

> Add a new system state document to the admin-guide. This document is
> intended to be used as a guide on how to gather higher level information
> about a system and its run-time activity.
>
> Signed-off-by: Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx>
> ---
> Changes since v1:
> -- Addressed review comments
>
> Documentation/admin-guide/index.rst | 1 +
> Documentation/admin-guide/system-state.rst | 350 +++++++++++++++++++++
> 2 files changed, 351 insertions(+)
> create mode 100644 Documentation/admin-guide/system-state.rst
>
> diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> index f475554382e2..541372672c55 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -66,6 +66,7 @@ subsystems expectations will be found here.
> :maxdepth: 1
>
> workload-tracing
> + system-state
>
> The rest of this manual consists of various unordered guides on how to
> configure specific aspects of kernel behavior to your liking.
> diff --git a/Documentation/admin-guide/system-state.rst b/Documentation/admin-guide/system-state.rst
> new file mode 100644
> index 000000000000..2a6fdf85c35c
> --- /dev/null
> +++ b/Documentation/admin-guide/system-state.rst
> @@ -0,0 +1,350 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
> +
> +===========================================================
> +Discovering system calls and features supported on a system
> +===========================================================
> +
> +:Author: Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx>
> +:maintained-by: Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx>

Rather than adding lines like this, I think everybody would be better
served with a MAINTAINERS file entry. get_maintainer.pl doesn't know
about these lines.

> +Key Points
> +==========
> +
> + * System state includes system calls, features, static and dynamic
> + modules enabled in the kernel configuration.
> + * Supported system calls and Kernel features are architecture dependent.
> + * auditd, checksyscalls.sh, and get_feat.pl tools can be used to discover
> + static system state.
> + * Understanding Linux kernel hardening configurations options and making
> + sure they are enabled will make a system more secure.
> + * Employing run-time tracing can shed light on the dynamic system state.
> + * Workloads could change the system state by loading and unloading dynamic
> + modules and tuning system parameters.

So what I'm missing, before this even, is a paragraph saying what this
document is actually for. Who is the intended audience, and why might
they want to read this document?

> +System State Visualization
> +==========================
> +
> +The kernel system state can be viewed as a combination of static and
> +dynamic features and modules. Let’s first define what static and dynamic
> +system states are and then explore how we can visualize the static and
> +dynamic system parts of the kernel.
> +
> +Static System View comprises system calls, features, static and dynamic
> +modules enabled in the kernel configuration. Supported system calls

So the "static system view" includes *dynamic* modules? Fine if that's
what you intended, but it reads a bit strangely.

> +and Kernel features are architecture dependent. System call numbering is
> +different on different architectures. We can get the supported system call
> +information using auditd utilities.
> +
> +ausyscall –dump prints out the supported system calls on a system and allows

Some clever software turned your "--" into an em-dash here.

> +mapping syscall names and numbers. You can install the auditd package on
> +Debian based systems::
> +
> + sudo apt-get install auditd
> +
> +scripts/checksyscalls.sh can be used to check if current architecture is
> +missing any system calls compared to i386.
> +
> +scripts/get_feat.pl can be used to list the Kernel feature support matrix
> +for an architecture.
> +
> +Dynamic System View comprises system calls, ioctls invoked, and subsystems
> +used during the runtime. A workload could load and unload modules and also
> +change the dynamic system configuration to suit its needs by tuning system
> +parameters.
> +
> +What is the methodology?
> +========================
> +
> +The first step is gathering the default system state such as the dynamic
> +and static modules loaded on the system. lsmod command prints out the

*The* lsmod command

> +dynamically loaded modules on a system. Statically configured modules can
> +be found in the kernel configuration file.
> +
> +The next step is discovering system activity during run-time. You can do so
> +by enabling event tracing and then running your favorite application. After
> +a period of time, gather the event logs, and kernel messages.

Might your intended readers need a hint on enabling tracing? A cross
reference to the appropriate docs if nothing else.

[Later I see you get to this; adding an "as described below" would help
here.]

> +Once you have the necessary information, you can extract the system call
> +numbers from the event trace log and map them to the supported system calls.
> +
> +Finding supported system calls
> +==============================
> +
> +As mentioned earlier, ausyscall prints out supported system calls
> +on a system and allows mapping syscalls names and numbers::
> +
> + ausyscall --dump
> +
> +You can look for specific system calls as shown in the below::
> +
> + ausyscall open
> + open 2
> + mq_open 240
> + openat 257
> + perf_event_open 298
> + open_by_handle_at 304
> + open_tree 428
> + fsopen 430
> + pidfd_open 434
> + openat2 437
> +
> + ausyscall time
> +
> + getitimer 36
> + setitimer 38
> + gettimeofday 96
> + times 100
> + rt_sigtimedwait 128
> + utime 132
> + adjtimex 159
> + settimeofday 164
> + time 201
> + semtimedop 220
> + timer_create 222
> + timer_settime 223
> + timer_gettime 224
> + timer_getoverrun 225
> + timer_delete 226
> + clock_settime 227
> + clock_gettime 228
> + utimes 235
> + mq_timedsend 242
> + mq_timedreceive 243
> + futimesat 261
> + utimensat 280
> + timerfd_create 283
> + timerfd_settime 286
> + timerfd_gettime 287
> + clock_adjtime 305
> +
> +Finding unsupported system calls
> +================================
> +
> +As mentioned earlier, scripts/checksyscalls.sh checks missing system calls
> +on current architecture compared to i386. Example run::
> +
> + checksyscalls.sh gcc
> + warning: #warning syscall mmap2 not implemented [-Wcpp]
> + warning: #warning syscall truncate64 not implemented [-Wcpp]
> + warning: #warning syscall ftruncate64 not implemented [-Wcpp]
> + warning: #warning syscall fcntl64 not implemented [-Wcpp]
> + warning: #warning syscall sendfile64 not implemented [-Wcpp]
> + warning: #warning syscall statfs64 not implemented [-Wcpp]
> + warning: #warning syscall fstatfs64 not implemented [-Wcpp]
> + warning: #warning syscall fadvise64_64 not implemented [-Wcpp]
> +
> +Let's check this against ausyscall now::
> +
> + ausyscall map
> + mmap 9
> + munmap 11
> + mremap 25
> + remap_file_pages 216
> +
> + ausyscall trunc
> + truncate 76
> + ftruncate 77
> +
> +As you can see, ausyscall shows mmap2, truncate64, and ftruncate64 aren't
> +implemented on this system. This matches what checksyscalls.sh shows.
> +
> +Finding supported features
> +==========================
> +
> +scripts/get_feat.pl can be used to list the Kernel feature support matrix
> +for an architecture::
> +
> + get_feat.pl list
> + get_feat.pl list –arch=arm64 lists

Lost the "--" again here

> +This scripts parses Documentation/features to find the support status

script (singular)

> +information. It can be used to validate the contents of the files under
> +Documentation/features or simply list them::
> +
> + --arch Outputs features for an specific architecture, optionally filtering
> + for a single specific feature.
> + --feat or --feature Output features for a single specific feature.
> +
> +Here is how you can find if stackprotector and hread-info-in-task features

and *thread*-info-in-task

> +are supported::
> +
> + scripts/get_feat.pl --arch=arm64 --feat=stackprotector list
> + #
> + # Kernel feature support matrix of the 'arm64' architecture:
> + #
> + debug/ stackprotector : ok | HAVE_STACKPROTECTOR #
> + arch supports compiler driven stack overflow protection
> +
> + scripts/get_feat.pl --feat=thread-info-in-task list
> + #
> + # Kernel feature support matrix of the 'x86' architecture:
> + #
> + core/ thread-info-in-task : ok | THREAD_INFO_IN_TASK #
> + arch makes use of the core kernel facility to embed thread_info in
> + task_struct
> +
> +Finding kernel module status
> +============================
> +
> +lsmod command shows the kernel modules that are currently loaded. This
> +program displays the contents of /proc/modules. Let's pick uvcvideo

*The* lsmod
*the* uvcvideo

> +module which is found on most laptops::
> +
> + lsmod | grep uvc
> + uvcvideo 126976 0
> + videobuf2_vmalloc 20480 1 uvcvideo
> + uvc 16384 1 uvcvideo
> + videobuf2_v4l2 36864 1 uvcvideo
> + videodev 315392 2 videobuf2_v4l2,uvcvideo
> + videobuf2_common 65536 4 videobuf2_vmalloc,videobuf2_v4l2,uvcvideo,videobuf2_memops
> + mc 77824 4 videodev,videobuf2_v4l2,uvcvideo,videobuf2_common
> +
> +You can see that lsmod shows uvcvideo and the modules it depends on and how
> +many modules are using them. videobuf2_common is in use by 4 other modules.
> +In other words, this is the reference count for this module and rmmod will
> +refuse to unload it as long as the reference count is > 0.
> +
> +You can get the same information from /proc.modules::
> +
> + less /proc/modules | grep uvc

why not just "grep uvc /proc/modules" ?

> + uvcvideo 126976 0 - Live 0x0000000000000000
> + videobuf2_vmalloc 20480 1 uvcvideo, Live 0x0000000000000000
> + uvc 16384 1 uvcvideo, Live 0x0000000000000000
> + videobuf2_v4l2 36864 1 uvcvideo, Live 0x0000000000000000
> + videodev 315392 2 uvcvideo,videobuf2_v4l2, Live 0x0000000000000000
> + videobuf2_common 65536 4 uvcvideo,videobuf2_vmalloc,videobuf2_memops,videobuf2_v4l2, Live 0x0000000000000000
> + mc 77824 4 uvcvideo,videobuf2_v4l2,videodev,videobuf2_common, Live 0x0000000000000000
> +
> +The information is similar with a few more extra fields. The address is the
> +base address for the module in kernel virtual memory space. When run as a
> +normal user, the address is all zeros. The same command when run as root will
> +be as follows::
> +
> + sudo less /proc/modules | grep uvc
> + uvcvideo 126976 0 - Live 0xffffffffc1c8b000
> + videobuf2_vmalloc 20480 1 uvcvideo, Live 0xffffffffc167f000
> + uvc 16384 1 uvcvideo, Live 0xffffffffc0ab0000
> + videobuf2_v4l2 36864 1 uvcvideo, Live 0xffffffffc0a28000
> + videodev 315392 2 uvcvideo,videobuf2_v4l2, Live 0xffffffffc16e9000
> + videobuf2_common 65536 4 uvcvideo,videobuf2_vmalloc,videobuf2_memops,videobuf2_v4l2, Live 0xffffffffc094d000
> + mc 77824 4 uvcvideo,videobuf2_v4l2,videodev,videobuf2_common, Live 0xffffffffc15eb000
> +
> +Let's check what modinfo shows that is important for us::
> +
> + /sbin/modinfo uvcvideo
> + filename: /lib/modules/6.3.0-rc2/kernel/drivers/media/usb/uvc/uvcvideo.ko
> + license: GPL
> + description: USB Video Class driver
> + depends: videobuf2-v4l2,videodev,mc,uvc,videobuf2-common,videobuf2-vmalloc
> + retpoline: Y
> + intree: Y
> + name: uvcvideo
> + vermagic: 6.3.0-rc2 SMP preempt mod_unload modversions
> + sig_id: PKCS#7
> + signer: Build time autogenerated kernel key
> +
> +This tells us that this module is built intree and the signed with a build
> +time autogenerated key.
> +
> +Let's do one last sanity check on the system to see if the following two
> +command outputs match::
> +
> + ps ax | wc -l
> + ls -d /proc/* | grep [0-9]|wc -l
> +
> +If they don't match, examine your system closely. kernel rootkits install
> +their own ps, find, etc. utilities to mask their activity. The outputs
> +match on my system. Do they on yours?

This would assume that there is no other activity on the system, of
course. Worth saying to avoid unnecessary panic.

> +Is my system as secure as it could be?
> +======================================
> +
> +Linux kernel supports several hardening options to make system secure.

*The* Linux kernel ... to make *the* system secure

the whole document could use a pass for article use

> +kconfig-hardened-check tool sanity checks kernel configuration for
> +security. You can clone the latest kconfig-hardened-check repository::
> +
> + git clone https://github.com/a13xp0p0v/kconfig-hardened-check.git
> + cd kconfig-hardened-check
> + bin/kconfig-hardened-check --config <config file> --cmdline /proc/cmdline

Should you say what <config file> is?

> +This will generate detailed report of kernel security configuration and
> +command line options that are enabled (OK) and the ones that aren't (FAIL)
> +and a summary line at the end::
> +
> + [+] Config check is finished: 'OK' - 100 / 'FAIL' - 100
> +
> +You will have to analyze the information to determine which options make
> +sense to enable on your system.
> +
> +Understanding system run-time activity
> +======================================
> +
> +Enabling event tracing gives insight into system run-time activity. This is
> +a good way to identify which parts of the kernel are used at a higher level
> +while system is in and/or while a specific workload/process is running.
> +
> +Event tracing depends on the CONFIG_EVENT_TRACING option enabled. You can
> +enable event tracing before starting workload/process. Event tracing allows
> +you to dynamically enable and disable tracing on supported/available events.
> +You can find available events, tracers, and filter functions in the following
> +files::
> +
> + /sys/kernel/debug/tracing/available_events
> + /sys/kernel/debug/tracing/available_filter_functions
> + /sys/kernel/debug/tracing/available_tracers
> +
> +Now this is how you can enable tracing::
> +
> + sudo echo 1 > /sys/kernel/debug/tracing/events/enable
> +
> +Once the workload/process stops or when you decide you have the status you
> +need, you can disable event tracing::
> +
> + sudo echo 0 > /sys/kernel/debug/tracing/events/enable
> +
> +You can find the tracing information in the file::
> +
> + /sys/kernel/debug/tracing
> +
> +Here is the information shown in this file::
> +
> + cat trace
> + # tracer: nop
> + #
> + # entries-in-buffer/entries-written: 0/0 #P:16
> + #
> + # _-----=> irqs-off/BH-disabled
> + # / _----=> need-resched
> + # | / _---=> hardirq/softirq
> + # || / _--=> preempt-depth
> + # ||| / _-=> migrate-disable
> + # |||| / delay
> + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> + # | | | ||||| | |
> +

That looks like the header, certainly not "the information" found in the
file. Including some actual output would make the following discussion
more comprehensible.

> +Analyzing traces
> +================
> +
> +You will be able map the functions to system calls and other kernel features
> +to get insight into the overall system activity while a workload/process is
> +running.
> +
> +Map the NR (syscal) numbers from the trace to syscalls from the syscalls dump.

(syscall)

> +Categorize system calls and map them to Linux subsystems.

Not sure what that sentence is trying to tell readers. Again, who is
the audience; will a readership that needs to be told how to install
auditd be able to make sense of this and act on it?

> +Conclusion
> +==========
> +
> +This document is intended to be used as a guide on how to gather higher level
> +information about a system and its run-time activity. The approach described
> +in this document helps us get insight into supported system calls, features,
> +assess how secure a system is, and its run-time activity.
> +
> +References
> +==========
> +
> + * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/checksyscalls.sh
> + * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/get_feat.pl
> + * https://github.com/a13xp0p0v/kconfig-hardened-check
> + * https://docs.kernel.org/trace/index.html

Thanks,

jon