Re: [PATCH 3/5] bpf: Add bpf_user_ringbuf_drain() helper

From: David Vernet
Date: Wed Aug 17 2022 - 16:25:23 EST


On Tue, Aug 16, 2022 at 03:59:54PM -0700, Andrii Nakryiko wrote:

[...]

> > > > > > + /* Synchronizes with smp_store_release() in
> > > > > > + * __bpf_user_ringbuf_sample_release().
> > > > > > + */
> > > > > > + cons_pos = smp_load_acquire(&rb->consumer_pos);
> > > > > > + if (cons_pos >= prod_pos) {
> > > > > > + atomic_set(&rb->busy, 0);
> > > > > > + return -ENODATA;
> > > > > > + }
> > > > > > +
> > > > > > + hdr = (u32 *)((uintptr_t)rb->data + (cons_pos & rb->mask));
> > > > > > + sample_len = *hdr;
> > > > >
> > > > > do we need smp_load_acquire() here? libbpf's ring_buffer
> > > > > implementation uses load_acquire here
> > > >
> > > > I thought about this when I was first adding the logic, but I can't
> > > > convince myself that it's necessary and wasn't able to figure out why we
> > > > did a load acquire on the len in libbpf. The kernel doesn't do a store
> > > > release on the header, so I'm not sure what the load acquire in libbpf
> > > > actually accomplishes. I could certainly be missing something, but I
> > > > _think_ the important thing is that we have load-acquire / store-release
> > > > pairs for the consumer and producer positions.
> > >
> > > kernel does xchg on len on the kernel side, which is stronger than
> > > smp_store_release (I think it was Paul's suggestion instead of doing
> > > explicit memory barrier, but my memories are hazy for exact reasons).
> >
> > Hmmm, yeah, for the kernel-producer ringbuffer, I believe the explicit
> > memory barrier is unnecessary because:
> >
> > o The smp_store_release() on producer_pos provides ordering w.r.t.
> > producer_pos, and the write to hdr->len which includes the busy-bit
> > should therefore be visibile in libbpf, which does an
> > smp_load_acquire().
> > o The xchg() when the sample is committed provides full barriers before
> > and after, so the consumer is guaranteed to read the written contents of
> > the sample IFF it also sees the BPF_RINGBUF_BUSY_BIT as unset.
> >
> > I can't see any scenario in which a barrier would add synchronization not
> > already provided, though this stuff is tricky so I may have missed
> > something.
> >
> > > Right now this might not be necessary, but if we add support for busy
> > > bit in a sample header, it will be closer to what BPF ringbuf is doing
> > > right now, with producer position being a reservation pointer, but
> > > sample itself won't be "readable" until sample header is updated and
> > > its busy bit is unset.
> >
> > Regarding this patch, after thinking about this more I _think_ I do need
> > an xchg() (or equivalent operation with full barrier semantics) in
> > userspace when updating the producer_pos when committing the sample.
> > Which, after applying your suggestion (which I agree with) of supporting
> > BPF_RINGBUF_BUSY_BIT and BPF_RINGBUF_DISCARD_BIT from the get-go, would
> > similarly be an xchg() when setting the header, paired with an
> > smp_load_acquire() when reading the header in the kernel.
>
>
> right, I think __atomic_store_n() can be used in libbpf for this with
> seq_cst ordering

__atomic_store_n(__ATOMIC_SEQ_CST) will do the correct thing on x86, but it
is not guaranteed to provide a full acq/rel barrier according to the C
standard. __atomic_store_n(__ATOMIC_SEQ_CST) means "store-release, and also
participates in the sequentially-consistent global ordering".

I believe we actually need an __atomic_store_n(__ATOMIC_ACQ_REL) here. I
don't _think_ __ATOMIC_SEQ_CST is necessary because we're not reliant on
any type of global, SC ordering. We just need to ensure that the sample is
only visible after the busy bit has been removed (or discard bit added) to
the header.

Thanks,
David