RE: The cause of the "VM" performance problem with 2.4.X

From: Van Maren, Kevin (kevin.vanmaren@unisys.com)
Date: Wed Aug 22 2001 - 17:23:46 EST


Thanks Andrew! I'll try out your patch shortly, and I'll let you
know how it goes.

> Note how fsync_dev() passes the target device to sync_buffers(). But
> the dirty buffer list is global. So to write out the dirty buffers
> for a particular device, write_locked_buffers() has to do a linear
> walk of the dirty buffers for *other* devices to find the target
> device.

I had just figured that out. I hadn't changed the code yet, though.

The other thing that exacerbates the start-from-0 approach is only
flushing 32 blocks at a time. Especially if we take a long time to
find the first one because of the linear search, not doing much
work once we get there is also a problem.

> And write_unlocked_buffers() uses a quite common construct - it
> scans a list but when it drops the lock, it restarts the scan
> from the start of the list. (We do this all over the kernel, and
> it keeps on biting us).

It sounds like we want a per-device list, not a global (linear) one.
But I'm sure there are other places in the kernel where the per-
device list would introduce problems (but would make things like
sync() better, since we could flush all devices more in parallel).
But there are probably also problems there when scaling to 100's or
1000's of disks.

> So if the dirty buffer list has 10,000 buffers for device A and
> then 10,000 buffers for device B, and you call fsync_dev(B),
> we end up traversing the 10,000 buffers of device A 10,000/32 times,
> which is a lot.
>
> In fact, write_unlocked_buffers(A) shoots itself in the foot by
> moving buffers for device A onto BUF_LOCKED, and then restarting the
> scan. So of *course* we end up with zillions on non-A buffers at the
> head of the list.

Is it safe to restart from the middle of the list after dropping
the lock? It looks like you try to solve that problem by only using
the hint if the block is still on the dirty list.

I was going to cache that per-device info globally as a quick hack;
I was thinking about hanging it off the kdev_t, since I didn't want
a fixed-size array in write_some_buffers(), but your approach of just
having the caller track it works too.

I'm off to try the patch...

But before I go, here is another profile run, but without your patch,
to validate your hypothesis (seem to have lost several interrupts)

[ia64_spinlock_contention is time spinning on a spinlock (you know
which one); myskupbuffer1 is called each time it "skips" a buffer
in the LRU list (no dev match).
Note that in this case, 26000 elements are skipped on the "average"
call to write_some_buffers.]

Each sample counts as 0.000976562 seconds.
  % cumulative self self total
 time seconds seconds calls ms/call ms/call name
 50.36 40.43 40.43
ia64_spinlock_contention
 23.98 59.68 19.25 2658 7.24 8.17 write_some_buffers
 17.10 73.40 13.72 cg_record_arc
  4.48 77.00 3.59 mcount
  1.36 78.09 1.10 85039 0.01 0.01 _make_request
  0.82 78.75 0.66 69385716 0.00 0.00 myskipbuffer1
  0.77 79.37 0.62 85039 0.01 0.01 blk_get_queue

index % time self children called name
                                                 <spontaneous>
[1] 52.7 40.43 0.00 ia64_spinlock_contention [1]
-----------------------------------------------
                2.68 0.34 370/2658 sync_old_buffers [11]
               16.57 2.12 2288/2658 write_unlocked_buffers [6]
[2] 28.3 19.25 2.47 2658 write_some_buffers [2]
                0.00 1.77 2658/2658 write_locked_buffers [12]
                0.66 0.00 69385716/69385716 myskipbuffer1 [16]
                0.01 0.01 85056/123329 _refile_buffer [54]
                0.01 0.00 85056/130255 _insert_into_lru_list [70]
                0.01 0.00 85056/85056 myatomic_set_buffer_clean
[93]
                0.00 0.00 85056/85056 my__refile_buffer [104]
                0.00 0.00 85056/85056 mytest_and_set_bit [113]
                0.00 0.00 2658/418442 spin_unlock_ [97]
-----------------------------------------------

Kevin Van Maren
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Aug 23 2001 - 21:00:51 EST