Re: Large File support and blocks.

From: Matti Aarnio (matti.aarnio@zmailer.org)
Date: Fri Sep 01 2000 - 05:30:26 EST


  Stephen, could you have a moment to look at the struct buffer_head {}
  alignment matters ? And possible configure time change to make the
  block number possibly a 'long long' variable ?

  Changeing field order might be doable now, while I definitely think that
  changeing blocknumber variable type is 2.5 matter.

On Fri, Sep 01, 2000 at 12:09:23AM -0700, Linda Walsh wrote:
> Perhaps an "easy" way to go would be to convert block numbers to
> type "block_nr_t", then one could measure the difference that 10's of
> nanoseconds make against seeks and reads of disk data.
>
> Obviously the effect would be greater on ramdisks, but on computers with
> equivalent generation hard disks we might be talking something non-measurable
> or at least down in the noise level. Even transfers of 512 bytes of
> data from a buffer to user space is going to take significantly more cycles.
>
> If the change became as simple as changing a typedef in a file, then you
> could even allow a user to choose to allow "very large files" -- compiling
> with long long when turned on, or simple ints when turned off -- if that
> was seen as a necessity -- but I'd still lean toward it not being noticable.
>
> -l

        Going forward on this thought: As we have special kernels for
        various processor versions, we just might go ahead and bite the
        bullet by allowing this selectable kernel ABI breakage..
        (As much as we hate it, but we have heaps of different almoast
         alike processors already in selections -- you are in deep
         trouble if you try to use P-III code at i386 ...)

        Having 64-bit block_nr_t won't hurt that much at high clock-
        rate P-II/III, while power starved embedded 386 definitely won't
        like of it. However the cache issues...

        Looking at the struct buffer_head, of which growth has been
        opposed by Linus in the past (see <linux/fs.h> for it), putting
        64-bit integer in place of 32-bit one grows the head by one
        extra padding (32 bits) at some architectures in addition to
        the variable size growth. A "whoops!" I think.

        I think that for example SPARC32 has stricter alignment rules than
        what i386 and 68k series processors have.

        Better (denser) alignments can be achieved by small rearrangement
        of fields, though.

        Reading the comment block above this structure definition gives
        me an impression, that it was the old dream back in 1.x, and now
        things have rotten seriously.. Structure field usage profiling,
        and then field reordering is definitely in order.

        Cache line sizes do vary in between processor models and families.
        Some i386 family models have 16 byte cache lines, P-II/III has 32.
        Alphas have 32 and 64 byte ones. (Most low-end embedded processors
        have none.)

        Current 2.4.0-test8 at i386:

byteoffsets

struct buffer_head {
        /* First cache line: */
0 struct buffer_head *b_next; /* Hash queue list */
4 unsigned long b_blocknr; /* block number */
8 unsigned short b_size; /* block size */
10 unsigned short b_list; /* List that this buffer appears */
12 kdev_t b_dev; /* device (B_FREE = free) */
14 (padding)

        /* Second cache line: */
16 atomic_t b_count; /* users using this block */
20 kdev_t b_rdev; /* Real device */
22 (padding)
24 unsigned long b_state; /* buffer state bitmap (see above) */
28 unsigned long b_flushtime; /* Time when (dirty) buffer should be written */

        /* Third cache line */
32 ... (LRU links, etc.)

        Same with 64-bit block_nr_t at 32-bit i386:

byteoffsets

struct buffer_head {
        /* First cache line: */
0 struct buffer_head *b_next; /* Hash queue list */
4 block_nr_t b_blocknr; /* block number */
12 unsigned short b_size; /* block size */

        /* Second cache line: */
16 unsigned short b_list; /* List that this buffer appears */
18 kdev_t b_dev; /* device (B_FREE = free) */
20 atomic_t b_count; /* users using this block */
24 kdev_t b_rdev; /* Real device */
26 (padding)
28 unsigned long b_state; /* buffer state bitmap (see above) */

        /* Third cache line: */
32 unsigned long b_flushtime; /* Time when (dirty) buffer should be written */
36 ... (LRU links, etc.)

        Same with 64-bit block_nr_t at 32-bit SPARC

byteoffsets

struct buffer_head {
        /* First cache line: */

0 struct buffer_head *b_next; /* Hash queue list */
4 (padding)
8 block_nr_t b_blocknr; /* block number */
16 unsigned short b_size; /* block size */
18 unsigned short b_list; /* List that this buffer appears */
20 kdev_t b_dev; /* device (B_FREE = free) */
22 (padding)
24 atomic_t b_count; /* users using this block */
28 kdev_t b_rdev; /* Real device */
30 (padding)

        /* Second cache line */
32 unsigned long b_state; /* buffer state bitmap (see above) */
36 unsigned long b_flushtime; /* Time when (dirty) buffer should be written */
40 ... (LRU links, etc.)

        REORDERING things with 64-bit block_nr_t at 32-bit SPARC

byteoffsets

struct buffer_head {
        /* First cache line: */
0 block_nr_t b_blocknr; /* block number */
8 struct buffer_head *b_next; /* Hash queue list */
12 unsigned short b_size; /* block size */
14 unsigned short b_list; /* List that this buffer appears */
16 kdev_t b_dev; /* device (B_FREE = free) */
18 kdev_t b_rdev; /* Real device */
20 atomic_t b_count; /* users using this block */
24 unsigned long b_state; /* buffer state bitmap (see above) */
28 unsigned long b_flushtime; /* Time when (dirty) buffer should be written */

        /* Second cache line: */
32 ... (LRU links, etc..)

        Same reordering with 64-bit block_nr_t at 64-bit Alpha:
        (also SPARC64)

byteoffsets

struct buffer_head {
        /* First cache line: */
0 block_nr_t b_blocknr; /* block number */
8 struct buffer_head *b_next; /* Hash queue list */
16 unsigned short b_size; /* block size */
18 unsigned short b_list; /* List that this buffer appears */
20 kdev_t b_dev; /* device (B_FREE = free) */
22 kdev_t b_rdev; /* Real device */
24 atomic_t b_count; /* users using this block */
28 (padding)
        /* maybe second cache line */
32 unsigned long b_state; /* buffer state bitmap (see above) */
40 unsigned long b_flushtime; /* Time when (dirty) buffer should be written */
48 ... (LRU links, etc..)

        Oh yes, 'short' fields at Alphas (earlier models) are sub-optimal
        for performance, and the processor must do magic shifting and
        masking at loading (quick thing), and convoluted read-modify-write
        sequence at storing (definitely not a quick thing..)
        Thus changeing 'short' type variables to 'int' type variables
        in as many frequently used places as possible gives better
        performance at Alpha models without additional BWX instructions.

        How e.g. SPARC and IA64 handle the situation, I don't know for sure.

> --
> Linda A Walsh | Trust Technology, Core Linux, SGI
> law@sgi.com | Voice: (650) 933-5338

/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Sep 07 2000 - 21:00:10 EST