Re: [patch] reduce IPI noise due to /dev/cdrom open/close

From: Milton Miller
Date: Tue Jul 04 2006 - 05:29:16 EST



On Jul 4, 2006, at 3:59 AM, Jes Sorensen wrote:

Milton Miller wrote:
On Jul 4, 2006, at 2:47 AM, Jes Sorensen wrote:
The idea I wrote the code under was that we are really only concerned to
make sure all bh's related to the device causing the invalidate are hit.
It doesn't matter that there are bh's in the lru from other devices.

And that is fine as far as it goes. The problem is that an unrelated
device might be be hit by this operation.
... [ fs POOF scenerio ]
Hrmpf, maybe it's back to putting the mask into the bdev then.

8 pointers * NR_CPUS - that can be a lot of cachelines you have to pull
in from remote :(

8 pointers in one array, hmm... 8*8 = 64 consecutive bytes, that is
half a cache line in my arch. (And on 32 bit archs, 8*4 = 32 bytes,
typically 1 full cache line). Add an alignment constraint if you like.
I was comparing to fetching a per-cpu variable saying we had any
entries; I'd say the difference is the time to do 8 loads once you have
the line vs the chance the other cpu is actively storing and stealing it
back before you finish. I'd be really surprised if it was stolen twice.

The problem is that if you throw the IPI onto multiple CPUs they will
all try and hit this array at the same time. Besides, on some platforms
this would be 64 * 8 * 8 (or even bigger) - scalability quickly goes
out the window, especially as it's to be fetched from another node :( I
know these are the extreme in today's world, but it's becoming more of
an issue with all the multi-core stuff we're going to see.


Huh? The cpus have their own local array, you scan the remote to see if you need to wake that guy up. Sort of like need_reseched/polling flags. Ok, they all have to pull it back from shared with the caller to exclusive/modified at once. Of course they had to do that with whatever you method you used to tell them what to do also.

The trade off is bouncing one line with the mask of who might have something relevant in a bit vector (two bounces if you clear always, shared if you do it lazy) vs pulling a line per cpu when you need to check, that will be actively pulled back every bh lookup.

Maybe you shouldn't push the work until some time expires. Just let wait a bit for the eventd to kick in first. After all, you would just be slowing down the thread trying to invalidate a dead device.

If you want to cut down on the cache line passing, then putting
the cpu mask in the bdev (for "I have cached a bh on this bdev
sometime") might be useful. You could even do a bdev specific
lru kill, but then we get into the next topic.

That could be an interesting way out.

Another approach is to age the cache. Only clear the bit during the IPI
call if you had nothing there on the last N calls (N being some
combination of absolute time and number of invalidate calls).

I guess if there's nothing in the LRU you don't get the call and the
number of cross node clears are limited, but it seems to get worse
exponentially with the size of the machine, especially if it's busy
doing IO, which is what worries me :(


If you are doing that much bh based IO then you will have to issue
the IPI, even with the cpu mask. Unless you go the per-bdev cpu_mask.

Next, I'll look for my asbestos suit and take a peak at the hald source.

* miltonm stands well back

later,
milton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/