Re: Perf and ftrace [was Re: PyTimechart]

From: Mathieu Desnoyers
Date: Thu May 13 2010 - 12:31:58 EST


* Steven Rostedt (rostedt@xxxxxxxxxxx) wrote:
> On Thu, 2010-05-13 at 09:20 -0400, Mathieu Desnoyers wrote:
[...]
> > > >
> > > > ...
> > > >
> > > > 97 /**
> > > > 98 * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > > > 99 */
> > > > 100 static __inline__
> > > > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > > > 102 unsigned long idx)
> > > > 103 {
> > > > 104 struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > > 105
> > > > 106 sb_pages = bufb->buf_wsb[idx].pages;
> > > > 107 for (;;) {
> > > > 108 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > > > 109 return; /* Already writing to this buffer */
> > > > 110 new_sb_pages = sb_pages;
> > > > 111 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > > > 112 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > > 113 sb_pages, new_sb_pages);
> > > > 114 if (likely(new_sb_pages == sb_pages))
> > > > 115 break;
> > > > 116 sb_pages = new_sb_pages;
> > >
> > > The writer calls this??
> >
> > Yes. But the common case (for each event) is simply a
> > "if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
> > performed at subbuffer boundary.
>
> Is the cmpxchg only contending with other writers?

No. Would have this been the case, I would have used a cmpxchg_local(). This
cmpxchg used to deal with subbuffer swap is touching the subbuffer "pages"
pointer, which can be updated concurrently by other writers as well as readers.

The writer clears the noref flags when starting to write in a subbuffers, and
sets it when delivering the subbuffer (when it is fully committed).

The reader can only ever swap the subbuffer with the one it owns if the noref
flag is set. The reader uses a cmpxchg() too to perform the swap.

[...]

> > >
> > > This looks just like the swap with reader_page that I do, except you use
> > > a table and I use the list. How do you replenish the buf_rsb.pages if
> > > the splice keeps the page you just received active?
> >
> > I don't allow other reads to proceed as long as splice is holding pages that
> > belong to the reader-owned subbuffer. The read semantic is basically:
> >
> > ring_buffer_open_read() /* only one reader at a time can open a ring buffer */
> > get_subbuf_size()
> > while (buffer is not finalized and empty) {
> > poll()
> > ret = ring_buffer_get_subbuf()
> > if (!ret)
> > continue;
> > /* The splice ops below can be performed in multiple calls, e.g. first splice
> > * only a portion of a subbuffer to a pipe, then splice to the disk/network,
> > * and move to the next subbuffer portion until all the subbuffer is sent.
> > */
> > splice one subbuffer worth of data to a pipe
> > splice the data from pipe to disk/network
> > ring_buffer_put_subbuf()
> > }
> > ring_buffer_close_read()
> >
> > The reader code above works both with flight recorder and non-overwrite mode.
> >
> > The code above assumes that upon return from the splice() to disk/network,
> > splice() is not using the pages anymore (I assume that splice() performs the
> > transfer synchronously with the call).
> >
> > The VFS interface I use for get_subbuf_size(), ring_buffer_get_subbuf() and
> > ring_buffer_put_subbuf() are new ioctls. Note that these can be used for both
> > splice() and mmap() types of backend access, as they only call into the
> > frontend.
>
> Hmm, so basically you lose pages until they are returned. I guess I can
> trivially add the same thing now to the current ring buffer.

Yep. Having the ability to keep an array of pages (rather that just a single
page at a time) allows splice() to move many pages at once efficiently, while
permitting this "pages owned by the readers, lend to splice() until it returns"
simplification. I also never have to allocate pages while tracing: all the pages
I need are allocated when the buffer is created (and at the special case of cpu
hotplug, but this is expected for per-cpu buffers).

In addition, this would play well with mmap() too: we can simply add a
ring_buffer_get_mmap_offset() method to the backend (exported through another
ioctl) that would let user-space know the start of the mmap'd buffer range
currently owned by the reader. So we can inform user-space of the currently
owned page range without even changing the underlying memory map.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/