On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote:On 05/23/2013 05:42 AM, Dave Chinner wrote:Right. But d_path will never be "low overhead", and as such itThank for the comment, but my point is that it is the d_lock
What was it I said about this patchset when you posted it to speed
up an Oracle benchmark back in february? I'll repeat:
"Nobody should be doing reverse dentry-to-name lookups in a quantity
sufficient for it to become a performance limiting factor."
contention is skewing the data about how much spin lock contention
had actually happened in the workload and it makes it harder to
pinpoint problem areas to look at. This is not about performance, it
is about accurate representation of performance data. Ideally, we
want the overhead of turning on perf instrumentation to be as low as
possible.
shouldn't be used by perf.
Yes, I know *how* contention occurs and what your patch does. I'mAnd that makes whatever that tracepoint is doing utterly stupid.What made it contend so much was the large number of CPUs available
Dumping out full paths in a tracepoint is hardly "low overhead", and
that tracepoint should be stomped on from a great height. Sure,
print the filename straight out of the dentry into a tracepoint,
but repeated calculating the full path (probably for the same set of
dentries) is just a dumb thing to do.
Anyway, your numbers show that a single AIM7 microbenchmark goes
better under tracing the specific mmap event that uses d_path(), but
the others are on average a little bit slower. That doesn't convince
me that this is worth doing. Indeed, what happens to performance
when you aren't tracing?
Indeed, have you analysed what makes that
microbenchmark contend so much on the dentry lock while reverse
lookups are occuring? Dentry lock contention does not necessarily
indicate a problem with the dentry locks, and without knowing why
there is contention occuring (esp. compared to the other benchmarks)
we can't really determine if this is a good solution or not...
in my test system which is a 8-socket Westmere EX machines with 80
cores. As perf was collecting data from every core, the threads will
unavoidably bump into each other to translate dentries back to the
full paths. The current code only allows one CPU in the d_path()
critical path. My patch will allow all of them to be in the critical
path concurrently.
asking you to explain *why* we need to fix this specific workload,
and why the tracepoint that calls d_path() actually needs to do
that.
Your numbers indicate that your patchset decreases peformance in the
common, non-d_path intensive workloads, so why should we add all
this complexity to optimise a slow path at the expense of
significant complexity and lower performance in the fast
path?
The upcoming Ivy-Bridge EX can have up to 15 cores per socket. SoI don't see why that matters. I've been dealing with systems much
even a 4-socket machine will have up to 60 cores or 120 virtual CPUs
if hyperthreading is turned on.
larger than that since early 2002, adn the process for delaing with
lock contention hasn't changed much in the last 10 years. First we
need to determine what you are doing is something that is sane,
determining whether there's a simpler fix than changing locking, and
whether it's actually any faster in the common case we care
about....