Re: [PATCH net-next 0/4] bpf: bpf_htab: Add BPF_MAP_TYPE_PERCPU_HASH

From: Martin KaFai Lau
Date: Mon Jan 11 2016 - 17:35:42 EST


Hi Ming,

On Mon, Jan 11, 2016 at 10:20:55AM +0800, Ming Lei wrote:
> >> - percpu hash can cause huge memory consumption compared with the way
~~~~~~~~~~~~~~~~~~
> >> of key plus cpu_id without obvious performance advantage
> > If the bpf prog uses the percpu map type, the expectation
> > is the bpf is tapping some kprobes that most cpus (if not all) will access
> > the same key. I would argue that having a bigger key (due to cpu_id),
>
> I think you mean the value mapped from the key can be accessed
> concurrently from most cpus.
>
> > and 40 more keys (and linking to 40 buckets) is less efficient.
>
> Suppose the key is choosed in this way and may be accessed concurrently
> from most cpus, the only extra loading for (real_key, cpu_id) is in memcmp(),
> because key size is increased by sizeof(cpu_id), is it really a big deal?
>
> Also in your percpu patch, you add one extra alloc_percpu_gfp(), do you
> think that it is cheaper than the extra compasion on cpu_id in memcmp()?
I was referring to your _memory_ consumption comment. The key is bigger,
more keys and buckets are needed.

and even for CPU concern, Yes also. If the key stays long enough for
aggregation (and atomic operations), is it better to do one alloc_percpu_gfp()
at the beginning instead of extra memcmp() over and over again during lookup?
Having said that, I don't think either of the alloc or memcmp cost is
significant enough to distract the percpu discussion.

>
> >
> > If the expectation is to have a small numbers of cpu accessing a key (like
> > biolatency), the bpf prog can stay with the regular bpf map. The new
> > percpu map has almost the same interface (except the syscall added in
> > Patch 3).
>
> > Switching between the two map types is easy.
>
> No, it isn't easy with your patchset, since we can't look up
> one key in one specific CPU from eBPF prog, not mention you
> introduce extra syscall.
Why would a eBPF prog (bpf_kern.o) want to access value of another CPU?
Does it defeat the purpose of using percpu value in the first place?

> >
> >> >> For HASH map, it is easy to make cpu id as part of key, then the map
> >> >> can be thought as percpu too, and atomic op isn't needed in eBPF program.
> >> > Putting the cpu id as part of the key was indeed the first hack I did
> >> > to get a sense of potential benefit.
> >> >
> >> > However, by extending the real-key with cpu-id, it is not intuitive to
> >> > use and it is prone to error. For example, how to delete a real-key for
> >> > all cpus? Iterating a particular real-key for all cpu is also tricky. What
> >> > does it mean if a real-key exists for cpu#0 but not cpu#1? The real-key
> >>
> >> lookup returns NULL if the key(real key plus cpu id) doesn't exist.
> >>
> >> > got deleted from all cpu while iterating? or something else? I believe
> >> > there are ways to get around but it is better to provide a clean
> >> > implementation instead.
> >>
> >> It might be true, but I hope there is one real example, in which we can
> >> see the obvious disadvantage of real key plus cpu_id. In the other way,
> >> we still can improve current hash map to make this case easier to use.
> > By extend(real_key, cpu_id), the uniqueness property of a real_key (and
> > it should be as a key for a hashmap) is getting blurry and it is not
> > ideal.
> >
> > The typical use case is for doing a lot of aggregation (which bpf+kprobe
> > is strong at) on a real_key which will exist for a reasonable long period
> > of time. For example, I would like to do some packet statistics on each
> > IPv6 /64 prefix. The userspace will iterate the /64 prefix (real_key)
^^^^^^^^^^^^^^^^^^^^^^
> > and sum up all values (of each cpu) before reporting for a particular
> > prefix.
>
> Is there any script or ebpf sample code for this case? I appreciate much
> we can talk with code.
Sure, a simplified version is at the end.

> >
> > How to delete a real_key from all cpu? A loop to iterate all
> > extend(real_key, cpu_id)?
>
> Yes.
That is not ideal then.

Let alone the max_entries of the map also needs to be changed
whenever it is running in another machine with different number
of cores.

> >> That sounds not a good result, 40X memory consumption only save %3 CPU,
> >> and the approach(real_key, cpu_id) still can save %3 without extra memory
> >> consumption(or much less than percpu MAP)
> > Another way to look at it, the bpf prog costs 6% CPU and saving 3%
> > with percpu implemenation.
>
> Sorry, I am a bit confused why you say 6% saving, and could explain a bit?
> Let's see your previous post:
I meant:
Without percpu optimization, the bpf_prog costs a total 6% CPU.
With the percpu optimization, it saves 3% CPU.

Thanks,
-- Martin

Here is a much stripped down version of the bpf_kern and bpf_user:

struct tcp_trace_flow6 {
__be32 dst0;
__be32 dst1;
};

struct tcp_stats {
u64 segs_in;
/* A few more counters... */
};

/* bpf_kern.c */
struct bpf_map_def SEC("maps") dst_rack_map6 = {
.type = BPF_MAP_TYPE_PERCPU_HASH,
.key_size = sizeof(struct tcp_trace_flow6),
.value_size = sizeof(struct tcp_stats),
.max_entries = 10000,
};

SEC("kprobe/tcp_rcv_established")
int trace_rcv_established(struct pt_regs *ctx)
{
struct tcp_stats *tpes;
struct sk_buff *skb;
struct sock *sk;

sk = (struct sock *) PT_REGS_PARM1(ctx);
skb = (struct sk_buff *) PT_REGS_PARM2(ctx);

/* tcp_stats_get: A simple map lookup based on the IPv6 /64 prefix.
* If it does not exist, insert
*/
tpes = tcp_stats_get(sk);
if (!tpes)
return 0;

tpes->segs_in++;

/* A free more counter++ operations here */

return 0;
}

/* bpf_user.c */
total = 0;
while (bpf_get_next_key(map_fd, &ttf6, &next_ttf6) == 0) {
for (cpu = 0; cpu < nr_cpus; cpu++) {
if (!bpf_lookup_percpu_elem(map_fd,
&next_ttf6,
&tpes, cpu))
total += tpes.segs_in;
else if (errno != ENXIO)
break;
}
if (cpu == nr_cpus)
tcp_stats_log6(&next_ttf6, total);
ttf6 = next_ttf6;
}