Re: [PATCH man-pages] bpf.2: new page documenting bpf(2)

From: Michael Kerrisk (man-pages)
Date: Tue Mar 10 2015 - 01:50:29 EST


Hi Alexei,

The page needs a license. See
https://www.kernel.org/doc/man-pages/licenses.html
for some possible choices.

Thanks,

Michael

On 03/09/2015 11:10 PM, Alexei Starovoitov wrote:
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxxxx>
> ---
> man2/bpf.2 | 593 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 593 insertions(+)
> create mode 100644 man2/bpf.2
>
> diff --git a/man2/bpf.2 b/man2/bpf.2
> new file mode 100644
> index 0000000..21b42b4
> --- /dev/null
> +++ b/man2/bpf.2
> @@ -0,0 +1,593 @@
> +.TH BPF 2 2015-03-09 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +bpf - perform a command on extended BPF map or program
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/bpf.h>
> +.sp
> +.BI "int bpf(int cmd, union bpf_attr *attr, unsigned int size);
> +
> +.SH DESCRIPTION
> +.BR bpf()
> +syscall is a multiplexor for a range of different operations on extended BPF
> +which can be characterized as "universal in-kernel virtual machine".
> +Extended BPF (or eBPF) is similar to original Berkeley Packet Filter
> +(or "classic BPF") used to filter network packets. Both statically analyze
> +the programs before loading them into the kernel to ensure that programs cannot
> +harm the running system.
> +.P
> +eBPF extends classic BPF in multiple ways including ability to call
> +in-kernel helper functions and access shared data structures like BPF maps.
> +The programs can be written in a restricted C that is compiled into
> +eBPF bytecode and executed on the in-kernel virtual machine or JITed into native
> +instruction set.
> +.SS Extended BPF Design/Architecture
> +.P
> +BPF maps is a generic storage of different types.
> +User process can create multiple maps (with key/value being
> +opaque bytes of data) and access them via file descriptor. In parallel BPF
> +programs can access maps from inside the kernel.
> +It's up to user process and BPF program to decide what they store inside maps.
> +.P
> +BPF programs are similar to kernel modules. They are loaded by the user
> +process and automatically unloaded when process exits. Each BPF program is
> +a safe run-to-completion set of instructions. BPF verifier statically
> +determines that the program terminates and is safe to execute. During
> +verification the program takes a hold of maps that it intends to use,
> +so selected maps cannot be removed until the program is unloaded. The program
> +can be attached to different events. These events can be packets, tracing
> +events and other types in the future. A new event triggers execution of
> +the program which may store information about the event in the maps.
> +Beyond storing data the programs may call into in-kernel helper functions.
> +The same program can be attached to multiple events. Different programs can
> +access the same map:
> +.nf
> + tracing tracing tracing packet packet
> + event A event B event C on eth0 on eth1
> + | | | | |
> + | | | | |
> + --> tracing <-- tracing socket socket
> + prog_1 prog_2 prog_3 prog_4
> + | | | |
> + |--- -----| |-------| map_3
> + map_1 map_2
> +.fi
> +.SS Syscall Arguments
> +.B bpf()
> +syscall operation is determined by
> +.IR cmd
> +which can be one of the following:
> +.TP
> +.B BPF_MAP_CREATE
> +Create a map with given type and attributes and return map FD
> +.TP
> +.B BPF_MAP_LOOKUP_ELEM
> +Lookup element by key in a given map and return its value
> +.TP
> +.B BPF_MAP_UPDATE_ELEM
> +Create or update element (key/value pair) in a given map
> +.TP
> +.B BPF_MAP_DELETE_ELEM
> +Lookup and delete element by key in a given map
> +.TP
> +.B BPF_MAP_GET_NEXT_KEY
> +Lookup element by key in a given map and return key of next element
> +.TP
> +.B BPF_PROG_LOAD
> +Verify and load BPF program
> +.TP
> +.B attr
> +is a pointer to a union of type bpf_attr as defined below.
> +.TP
> +.B size
> +is the size of the union.
> +.P
> +.nf
> +union bpf_attr {
> + struct { /* anonymous struct used by BPF_MAP_CREATE command */
> + __u32 map_type;
> + __u32 key_size; /* size of key in bytes */
> + __u32 value_size; /* size of value in bytes */
> + __u32 max_entries; /* max number of entries in a map */
> + };
> +
> + struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> + __u32 map_fd;
> + __aligned_u64 key;
> + union {
> + __aligned_u64 value;
> + __aligned_u64 next_key;
> + };
> + __u64 flags;
> + };
> +
> + struct { /* anonymous struct used by BPF_PROG_LOAD command */
> + __u32 prog_type;
> + __u32 insn_cnt;
> + __aligned_u64 insns; /* 'const struct bpf_insn *' */
> + __aligned_u64 license; /* 'const char *' */
> + __u32 log_level; /* verbosity level of verifier */
> + __u32 log_size; /* size of user buffer */
> + __aligned_u64 log_buf; /* user supplied 'char *' buffer */
> + };
> +} __attribute__((aligned(8)));
> +.fi
> +.SS BPF maps
> +maps is a generic storage of different types for sharing data between kernel
> +and userspace.
> +
> +Any map type has the following attributes:
> + . type
> + . max number of elements
> + . key size in bytes
> + . value size in bytes
> +
> +The following wrapper functions demonstrate how this syscall can be used to
> +access the maps. The functions use the
> +.IR cmd
> +argument to invoke different operations.
> +.TP
> +.B BPF_MAP_CREATE
> +.nf
> +int bpf_create_map(enum bpf_map_type map_type, int key_size,
> + int value_size, int max_entries)
> +{
> + union bpf_attr attr = {
> + .map_type = map_type,
> + .key_size = key_size,
> + .value_size = value_size,
> + .max_entries = max_entries
> + };
> +
> + return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
> +}
> +.fi
> +bpf() syscall creates a map of
> +.I map_type
> +type and given attributes
> +.I key_size, value_size, max_entries.
> +On success it returns process-local file descriptor. On error, \-1 is returned and
> +.I errno
> +is set to EINVAL or EPERM or ENOMEM.
> +
> +The attributes
> +.I key_size
> +and
> +.I value_size
> +will be used by verifier during program loading to check that program is calling
> +bpf_map_*_elem() helper functions with correctly initialized
> +.I key
> +and that program doesn't access map element
> +.I value
> +beyond specified
> +.I value_size.
> +For example, when map is created with key_size = 8 and program does:
> +.nf
> +bpf_map_lookup_elem(map_fd, fp - 4)
> +.fi
> +such program will be rejected,
> +since in-kernel helper function bpf_map_lookup_elem(map_fd, void *key) expects
> +to read 8 bytes from 'key' pointer, but 'fp - 4' starting address will cause
> +out of bounds stack access.
> +
> +Similarly, when map is created with value_size = 1 and program does:
> +.nf
> +value = bpf_map_lookup_elem(...);
> +*(u32 *)value = 1;
> +.fi
> +such program will be rejected, since it accesses
> +.I value
> +pointer beyond specified 1 byte value_size limit.
> +
> +Currently two
> +.I map_type
> +are supported:
> +.nf
> +enum bpf_map_type {
> + BPF_MAP_TYPE_UNSPEC,
> + BPF_MAP_TYPE_HASH,
> + BPF_MAP_TYPE_ARRAY,
> +};
> +.fi
> +.I map_type
> +selects one of the available map implementations in kernel. For all map_types
> +programs access maps with the same bpf_map_lookup_elem()/bpf_map_update_elem()
> +helper functions.
> +.TP
> +.B BPF_MAP_LOOKUP_ELEM
> +.nf
> +int bpf_lookup_elem(int fd, void *key, void *value)
> +{
> + union bpf_attr attr = {
> + .map_fd = fd,
> + .key = ptr_to_u64(key),
> + .value = ptr_to_u64(value),
> + };
> +
> + return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
> +}
> +.fi
> +bpf() syscall looks up an element with given
> +.I key
> +in a map
> +.I fd.
> +If element is found it returns zero and stores element's value into
> +.I value.
> +If element is not found it returns \-1 and sets
> +.I errno
> +to ENOENT.
> +.TP
> +.B BPF_MAP_UPDATE_ELEM
> +.nf
> +int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
> +{
> + union bpf_attr attr = {
> + .map_fd = fd,
> + .key = ptr_to_u64(key),
> + .value = ptr_to_u64(value),
> + .flags = flags,
> + };
> +
> + return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
> +}
> +.fi
> +The call creates or updates element with given
> +.I key/value
> +in a map
> +.I fd
> +according to
> +.I flags
> +which can have 3 possible values:
> +.nf
> +#define BPF_ANY 0 /* create new element or update existing */
> +#define BPF_NOEXIST 1 /* create new element if it didn't exist */
> +#define BPF_EXIST 2 /* update existing element */
> +.fi
> +On success it returns zero.
> +On error, \-1 is returned and
> +.I errno
> +is set to EINVAL or EPERM or ENOMEM or E2BIG.
> +.B E2BIG
> +indicates that number of elements in the map reached
> +.I max_entries
> +limit specified at map creation time.
> +.B EEXIST
> +will be returned from call bpf_update_elem(fd, key, value, BPF_NOEXIST) if element
> +with 'key' already exists in the map.
> +.B ENOENT
> +will be returned from call bpf_update_elem(fd, key, value, BPF_EXIST) if element
> +with 'key' doesn't exist in the map.
> +.TP
> +.B BPF_MAP_DELETE_ELEM
> +.nf
> +int bpf_delete_elem(int fd, void *key)
> +{
> + union bpf_attr attr = {
> + .map_fd = fd,
> + .key = ptr_to_u64(key),
> + };
> +
> + return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
> +}
> +.fi
> +The call deletes an element in a map
> +.I fd
> +with given
> +.I key.
> +Returns zero on success. If element is not found it returns \-1 and sets
> +.I errno
> +to ENOENT.
> +.TP
> +.B BPF_MAP_GET_NEXT_KEY
> +.nf
> +int bpf_get_next_key(int fd, void *key, void *next_key)
> +{
> + union bpf_attr attr = {
> + .map_fd = fd,
> + .key = ptr_to_u64(key),
> + .next_key = ptr_to_u64(next_key),
> + };
> +
> + return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
> +}
> +.fi
> +The call looks up an element by
> +.I key
> +in a given map
> +.I fd
> +and returns key of the next element into
> +.I next_key
> +pointer. If
> +.I key
> +is not found, it return zero and returns key of the first element into
> +.I next_key. If
> +.I key
> +is the last element, it returns \-1 and sets
> +.I errno
> +to ENOENT. Other possible
> +.I errno
> +values are ENOMEM, EFAULT, EPERM, EINVAL.
> +This method can be used to iterate over all elements of the map.
> +.TP
> +.B close(map_fd)
> +will delete the map
> +.I map_fd.
> +Exiting process will delete all maps automatically.
> +.P
> +.SS BPF programs
> +
> +.TP
> +.B BPF_PROG_LOAD
> +This
> +.IR cmd
> +is used to load extended BPF program into the kernel.
> +
> +.nf
> +char bpf_log_buf[LOG_BUF_SIZE];
> +
> +int bpf_prog_load(enum bpf_prog_type prog_type,
> + const struct bpf_insn *insns, int insn_cnt,
> + const char *license)
> +{
> + union bpf_attr attr = {
> + .prog_type = prog_type,
> + .insns = ptr_to_u64(insns),
> + .insn_cnt = insn_cnt,
> + .license = ptr_to_u64(license),
> + .log_buf = ptr_to_u64(bpf_log_buf),
> + .log_size = LOG_BUF_SIZE,
> + .log_level = 1,
> + };
> +
> + return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
> +}
> +.fi
> +.B prog_type
> +is one of the available program types:
> +.nf
> +enum bpf_prog_type {
> + BPF_PROG_TYPE_UNSPEC,
> + BPF_PROG_TYPE_SOCKET_FILTER,
> + BPF_PROG_TYPE_SCHED_CLS,
> +};
> +.fi
> +By picking
> +.I prog_type
> +program author selects a set of helper functions callable from
> +the program and corresponding format of
> +.I struct bpf_context
> +(which is the data blob passed into the program as the first argument).
> +For example, the programs loaded with
> +.I prog_type
> += BPF_PROG_TYPE_SOCKET_FILTER may call bpf_map_lookup_elem() helper,
> +whereas some future types may not be.
> +The set of functions available to the programs under given type may increase
> +in the future.
> +
> +Currently the set of functions for
> +.B BPF_PROG_TYPE_SOCKET_FILTER
> +is:
> +.nf
> +bpf_map_lookup_elem(map_fd, void *key) // lookup key in a map_fd
> +bpf_map_update_elem(map_fd, void *key, void *value) // update key/value
> +bpf_map_delete_elem(map_fd, void *key) // delete key in a map_fd
> +.fi
> +
> +and bpf_context is a pointer to 'struct sk_buff'. Programs cannot
> +access fields of 'sk_buff' directly.
> +
> +More program types may be added in the future. Like
> +.B BPF_PROG_TYPE_KPROBE
> +and bpf_context for it may be defined as a pointer to 'struct pt_regs'.
> +
> +.B insns
> +array of "struct bpf_insn" instructions
> +
> +.B insn_cnt
> +number of instructions in the program
> +
> +.B license
> +license string, which must be GPL compatible to call helper functions
> +marked gpl_only
> +
> +.B log_buf
> +user supplied buffer that in-kernel verifier is using to store verification
> +log. Log is a multi-line string that should be used by program author to
> +understand how verifier came to conclusion that program is unsafe. The format
> +of the output can change at any time as verifier evolves.
> +
> +.B log_size
> +size of user buffer. If size of the buffer is not large enough to store all
> +verifier messages, \-1 is returned and
> +.I errno
> +is set to ENOSPC.
> +
> +.B log_level
> +verbosity level of verifier, where zero means no logs provided
> +.TP
> +.B close(prog_fd)
> +will unload BPF program
> +.P
> +The maps are accesible from programs and used to exchange data between
> +programs and between program and user space.
> +Programs process various events (like kprobe, packets) and
> +store the data into maps. User space fetches data from maps.
> +Either the same or a different map may be used by user space as configuration
> +space to alter program behavior on the fly.
> +.SS Events
> +.P
> +Once the program is loaded, it can be attached to an event. Various kernel
> +subsystems have different ways to do so. For example:
> +
> +.nf
> +setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
> +.fi
> +will attach the program
> +.I prog_fd
> +to socket
> +.I sock
> +which was received by prior call to socket().
> +
> +In the future
> +.nf
> +ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +.fi
> +may attach the program
> +.I prog_fd
> +to perf event
> +.I event_fd
> +which was received by prior call to perf_event_open().
> +
> +.SH EXAMPLES
> +.nf
> +/* bpf+sockets example:
> + * 1. create array map of 256 elements
> + * 2. load program that counts number of packets received
> + * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
> + * map[r0]++
> + * 3. attach prog_fd to raw socket via setsockopt()
> + * 4. print number of received TCP/UDP packets every second
> + */
> +int main(int ac, char **av)
> +{
> + int sock, map_fd, prog_fd, key;
> + long long value = 0, tcp_cnt, udp_cnt;
> +
> + map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256);
> + if (map_fd < 0) {
> + printf("failed to create map '%s'\\n", strerror(errno));
> + /* likely not run as root */
> + return 1;
> + }
> +
> + struct bpf_insn prog[] = {
> + BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */
> + BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), /* r0 = ip->proto */
> + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
> + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */
> + BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */
> + BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), /* r0 = map_lookup(r1, r2) */
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), /* if (r0 == 0) goto pc+2 */
> + BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
> + BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* lock *(u64 *)r0 += r1 */
> + BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
> + BPF_EXIT_INSN(), /* return r0 */
> + };
> +
> + prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog), "GPL");
> +
> + sock = open_raw_sock("lo");
> +
> + assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) == 0);
> +
> + for (;;) {
> + key = IPPROTO_TCP;
> + assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
> + key = IPPROTO_UDP
> + assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
> + printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
> + sleep(1);
> + }
> +
> + return 0;
> +}
> +.fi
> +.SH RETURN VALUE
> +For a successful call, the return value depends on the operation:
> +.TP
> +.B BPF_MAP_CREATE
> +The new file descriptor associated with BPF map.
> +.TP
> +.B BPF_PROG_LOAD
> +The new file descriptor associated with BPF program.
> +.TP
> +All other commands
> +Zero.
> +.PP
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EPERM
> +bpf() syscall was made without sufficient privilege
> +(without the
> +.B CAP_SYS_ADMIN
> +capability).
> +.TP
> +.B ENOMEM
> +Cannot allocate sufficient memory.
> +.TP
> +.B EBADF
> +.I fd
> +is not an open file descriptor
> +.TP
> +.B EFAULT
> +One of the pointers (
> +.I key
> +or
> +.I value
> +or
> +.I log_buf
> +or
> +.I insns
> +) is outside accessible address space.
> +.TP
> +.B EINVAL
> +The value specified in
> +.I cmd
> +is not recognized by this kernel.
> +.TP
> +.B EINVAL
> +For
> +.BR BPF_MAP_CREATE ,
> +either
> +.I map_type
> +or attributes are invalid.
> +.TP
> +.B EINVAL
> +For
> +.BR BPF_MAP_*_ELEM
> +commands,
> +some of the fields of "union bpf_attr" unused by this command are not set
> +to zero.
> +.TP
> +.B EINVAL
> +For
> +.BR BPF_PROG_LOAD,
> +attempt to load invalid program (unrecognized instruction or uses reserved
> +fields or jumps out of range or loop detected or calls unknown function).
> +.TP
> +.BR EACCES
> +For
> +.BR BPF_PROG_LOAD,
> +though program has valid instructions, it was rejected, since it was deemed
> +unsafe (may access disallowed memory region or uninitialized stack/register
> +or function constraints don't match actual types or misaligned access). In
> +such case it is recommended to call bpf() again with
> +.I log_level = 1
> +and examine
> +.I log_buf
> +for specific reason provided by verifier.
> +.TP
> +.BR ENOENT
> +For
> +.B BPF_MAP_LOOKUP_ELEM
> +or
> +.B BPF_MAP_DELETE_ELEM,
> +indicates that element with given
> +.I key
> +was not found.
> +.TP
> +.BR E2BIG
> +program is too large or
> +a map reached
> +.I max_entries
> +limit (max number of elements).
> +.SH NOTES
> +These commands may be used only by a privileged process (one having the
> +.B CAP_SYS_ADMIN
> +capability).
> +.SH SEE ALSO
> +Both classic and extended BPF is explained in Documentation/networking/filter.txt
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/