Large files, repartitioning, a cleaner algorithm, Virtual memory stuff, and Reiserfs

Hans Reiser (reiser@ricochet.net)
Thu, 05 Mar 1998 14:20:52 -0800


This is a multi-part message in MIME format.
--------------705C2C4C1A3F6A5BD4FDDAA9
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Vladimir and Anatoly, the two pieces of feedback I am getting from
potential users are:

They would like us to handle large files and large file systems. Now
that our industry is getting into storing DVD movies and the like on
disk, the current file size limit is a pain. They think we should start
out handling large files from the beginning, to save conversion costs
later. After we get the current bugs out we should do it, before we
have users who care about the pain of running mkfs and restoring from
tape. This will bloat our key size though. Oh well. Make it a single
#define please.

if we could modify fsck to shift used blocks all the way to the left or
right of the partition, and call fdisk to resize the partition, and
adjust the bitmap, people like VA Research (they sell linux hardware,
they are also pretty cool....) would be tempted to use our file system
for that reason alone. Note that it is my hope that this would not be
hard to code into reiserfsck. It is a prime example of something
utterly uninteresting that users really like.:-) It can be slow if that
makes the code simple. It won't repack nodes at first, to keep it
simple at first. Think about it this month, maybe you can do it next
month after the current bugs are gone, ok?

Then 6 months later we can take maybe 6 weeks to more fully implement
this algorithm which follows. If we implement the algorithm which
follows, it will do more for optimizing our file system than we can
possibly accomplish dynamically. Note that it does not accomodate high
availability concerns, as it requires unmounting the file system. Doing
it on a live FS should be done as a later refinement, maybe next year.

(Please store this algorithm in cleaner.alg in our utils directory.)

Note how it clusters nodes in clusters of size CLUSTER, and then puts
space between the clusters. This reduces fragmentation, but does not
require new insertions or preserve list shifted blocks to end up a long
ways away from their neighbors. I imagine CLUSTER_SIZE to be about 1
track in size, and CLUSTER_SPACING to be some even distribution of the
free blocks. Note that users who want excessive perfection will need to
run it twice, once to find out how small the file system can be
repacked, and once to specify the CLUSTER_SPACING precisely knowing how
many free blocks there will be.

Algorithm:

mark the super block as needing cleaning on next fsck.
user unmounts file system (rebooting if it is the root fs), and then
runs reiserfsck on it.
reiserfsck checks to see if super_block is flagged as the user wanting
Reiserfs cleaned and possibly repartitioned.

cleaner() procedure in reiserfsck takes as parameters the starting
cylinder block number,
the ending cylinder block number, the "CLUSTER" in blocks, and the
"CLUSTER_SPACING" in blocks as parameters;

check parameters to make sure they compute sensibly;

/* cycle through the whole semantic tree */
for (current_item = leftmost_item_in_tree, disk_position =
start_of_new_partition; current_item < end_of_semantic_tree; ) {

walk the semantic tree, checking the location of each node in the tree,
move the blocks between disk_position and disk_position + SCRATCH
to the last blocks that are currently free in the bitmap.
Note that fragmentation and the cost of tree walking will make this the
slow part of the algorithm.

/* Fill up the empty space just created with perfectly packed and layed
out tree */
while (blocks_remaining_in_current_scratch_area > 0)
{
/* laydown CLUSTER_SIZE blocks consisting of fully packed and
perfectly ordered nodes,
and then put down a gap of CLUSTER_SPACING blocks */
while (blocks_remaining_in_current_cluster > 0) {
while (current_node is not full)
{
if current_item is not in memory, {
read SCRATCH number of nodes into memory,
starting from current_item and moving right in the semantic tree
}
place as much of current item as can fit into the current_node;
split item if necessary;
if current_item last item in tree then we are done;
increment current_item;
}
blocks_remaining_in_current_cluster--;
blocks_remaining_in_current_scratch_area--;
disk_position++;
}
write all nodes in cluster to disk;
disk_position = disk_position + cluster_spacing;
blocks_remaining_in_current_cluster = CLUSTER_SIZE;
}

}

Rik van Riel wrote:
>
> On Thu, 5 Mar 1998, Hans Reiser wrote:
>
> > There is a difference between a 4G file limit and a 4G mmap limit.
> >
> > The 4G mmap limit is an inherent mm issue, and nothing I do can fix it.
> > Or am I wrong?
>
> In the struct_page there's a off_t page->offset member
> to indicate the file offset _from the start of the file_
> when mmap()ing a file.
> This means that everything over the 2g mark can't
> be mmap()ed...
> If you want to support larger files, you'd also want a
> trick to mmap them (maybe two offset members, so you can
> shift the 4g 'window' over the file?)
>
> > What are you doing that isn't already size increased in 2.1?
>
> We recently shrunk the struct_page. Kswapd is getting
> more efficient. Euhm...
>
> Rik.
> +-----------------------------+------------------------------+
> | For Linux mm-patches, go to | "I'm busy managing memory.." |
> | my homepage (via LinuxHQ). | H.H.vanRiel@fys.ruu.nl |
> | ...submissions welcome... | http://www.fys.ruu.nl/~riel/ |
> +-----------------------------+------------------------------+

Rik, wouldn't it be better for me to let you change the
page->offset member to a larger size, and then trivially adapt Reiserfs
to work with that?
Forgive me, I haven't looked at the vm code in 6 months.

Hans
--------------705C2C4C1A3F6A5BD4FDDAA9
Content-Type: text/x-vcard; charset=us-ascii; name="vcard.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Hans Reiser
Content-Disposition: attachment; filename="vcard.vcf"

begin: vcard
fn: Hans Reiser
n: Reiser;Hans
org: The Naming System Venture
adr: 6979 Exeter Dr.;;;Oakland;CA;94611;USA
email;internet: reiser@ricochet.net
title: Owner
tel;work: 510-482-2483
tel;home: 510-482-5071
note: Phone: +1 (510) 459-4681 (cell phone)
x-mozilla-cpt: ;0
x-mozilla-html: TRUE
version: 2.1
end: vcard

--------------705C2C4C1A3F6A5BD4FDDAA9--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu