Re: newidle balancing in NUMA domain?

From: Jason Garrett-Glaser
Date: Tue Dec 01 2009 - 03:18:52 EST


On Mon, Nov 30, 2009 at 12:19 AM, Nick Piggin <npiggin@xxxxxxx> wrote:
> On Tue, Nov 24, 2009 at 09:24:26AM -0800, Jason Garrett-Glaser wrote:
>> > Quite a few being one test case, and on a program with a horrible
>> > parallelism design (rapid heavy weight forks to distribute small
>> > units of work).
>>
>> > If x264 is declared dainbramaged, that's fine with me too.
>>
>> We did multiple benchmarks using a thread pool and it did not help.
>> If you want to declare our app "braindamaged", feel free, but pooling
>> threads to avoid re-creation gave no benefit whatsoever.  If you think
>> the parallelism methodology is wrong as a whole, you're basically
>> saying that Linux shouldn't be used for video compression, because
>> this is the exact same threading model used by almost every single
>> video encoder ever made.  There are actually a few that use
>> slice-based threading, but those are actually even worse from your
>> perspective, because slice-based threading spawns mulitple threads PER
>> FRAME instead of one per frame.
>>
>> Because of the inter-frame dependencies in video coding it is
>> impossible to efficiently get a granularity of more than one thread
>> per frame.  Pooling threads doesn't change the fact that you are
>> conceptually creating a thread for each frame--it just eliminates the
>> pthread_create call.  In theory you could do one thread per group of
>> frames, but that is completely unrealistic for real-time encoding
>> (e.g. streaming), requires a catastrophically large amount of memory,
>> makes it impossible to track the bit buffer, and all other sorts of
>> bad stuff.
>
> If you can scale to N threads by having 1 frame per thread, then
> you can scale to N/2 threads and have 2 frames per thread. Can't
> you?
>
> Is your problem in scaling to a large N?
>
>

x264's threading is described here:
http://akuvian.org/src/x264/sliceless_threads.txt

By example (3 threads), simplified:

Step 0:
Frame 0: 0% done

Step 1:
Frame 0: 33% done
Frame 1: 0% done

Step 2:
Frame 0: 66% done
Frame 1: 33% done
Frame 2: 0% done

Step 3:
Frame 0: 100% done
Frame 1: 66% done
Frame 2: 33% done
Frame 3: 0% done

Step 4:
Frame 1: 100% done
Frame 2: 66% done
Frame 3: 33% done
Frame 4: 0% done

(etc)

The motion search is restricted so that, for example, in Step 3, frame
2 doesn't look beyond the completed 66% of frame 1.

There's room reserved in terms of height for sync so that each thread
doesn't have to be exactly in lock-step with the others. This avoids
most unnecessary waiting.

The problem is that each frame is inherently one "work unit". Its
dependencies all consist on the previous frame (Frame 1 depends on
Frame 0). It doesn't make any sense to try to lump multiple frames
together into a work unit when the dependencies don't work that way.
Just dumping two frames arbitrarily in one thread turns this into a
thread pool, which as mentioned previously probably wouldn't help
significantly. If you meant working on two frames simultaneously in
the same thread, that's even worse--it's going to be a cache thrashing
disaster, since the scheduler can no longer move two threads to
separate cores, and you now have two totally separate sets of
processing trying to dump themselves into the same cache.
Furthermore, that doesn't reduce the main limitation on threading: the
vertical height of the frame.

Also, another thing to note is that "fast thread creation" isn't the
only problem here: the changes to the scheduler gave x264 enormous
speed boosts even at *slower* encoding modes. One user reported a
gain from 25fps -> 39fps, for example; that's dozens of milliseconds
per thread, far longer than I would think would cause problems due to
threads being too short lived. You should probably consider doing
some testing with slower encoding as well, both in terms of fast
settings and high-resolution inputs--and slow settings with
low-resolution inputs, where the bottleneck is purely computational.

Some resources for such testing:

1. http://media.xiph.org/video/derf/ has a lot of free test clips (HD
ones at the bottom).
2. x264 --help lists a set of presets from "ultrafast" to "placebo"
which can be used for testing purposes. "veryslow" and "placebo" are
probably not very suitable as they often tend to be horrifically
lookahead-bottlenecked.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/