Re: [PATCH 1/2] ocfs2: Zero the tail cluster when extending pasti_size v2

From: Tao Ma
Date: Sun Jul 04 2010 - 23:53:16 EST


Hi Joel,

On 07/04/2010 05:32 AM, Joel Becker wrote:
Here's the second version of my corruption fix. It fixes two
bugs:

1) i_size can obviously be at a place that is a hole, so don't BUG on
that.
2) Fix an off-by-one when checking whether the write position is within
the tail allocation.

This version passes my tail corruption test as well as the kernel
compile that exposed the two bugs above.

Joel

---------------------------------------------------------------

ocfs2's allocation unit is the cluster. This can be larger than a block
or even a memory page. This means that a file may have many blocks in
its last extent that are beyond the block containing i_size.

When ocfs2 grows a file, it zeros the entire cluster in order to ensure
future i_size growth will see cleared blocks. Unfortunately,
block_write_full_page() drops the pages past i_size. This means that
ocfs2 is actually leaking garbage data into the tail end of that last
cluster.

We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
when a write or truncate is past i_size. If there is any existing
allocation between the block containing the current i_size and the
location of the write or truncate, zeros will be written to that
allocation.

This is only for sparse filesystems. Non-sparse filesystems already get
this via ocfs2_extend_no_holes().

Signed-off-by: Joel Becker<joel.becker@xxxxxxxxxx>
---
fs/ocfs2/aops.c | 22 ++++----
fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------
fs/ocfs2/file.h | 2 +
3 files changed, 150 insertions(+), 28 deletions(-)

<snip>
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6a13ea6..7fca78d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -848,6 +848,137 @@ out:
return ret;
}

+/*
+ * This function is a helper for ocfs2_zero_tail(). It calculates
+ * what blocks need zeroing and does any CoW necessary.
+ */
+static int ocfs2_zero_tail_prepare(struct inode *inode,
+ struct buffer_head *di_bh,
+ loff_t pos, u64 *start_blkno,
+ u64 *blocks)
+{
+ int rc = 0;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ u32 tail_cpos, pos_cpos, p_cpos;
+ u64 tail_blkno, pos_blkno, blocks_to_zero;
+ unsigned int num_clusters = 0;
+ unsigned int ext_flags = 0;
+
+ /*
+ * The block containing i_size has already been zeroed, so our tail
+ * block is the first block after i_size. The block containing
+ * pos will be zeroed. So we only need to do anything if
+ * tail_blkno is before pos_blkno.
+ */
+ tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1;
+ pos_blkno = pos>> inode->i_sb->s_blocksize_bits;
+ mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
+ (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
+ if (pos_blkno<= tail_blkno)
+ goto out;
+ blocks_to_zero = pos_blkno - tail_blkno;
+
+ /*
+ * If tail_blkno is in the cluster past i_size, we don't need
+ * to touch the cluster containing i_size at all.
+ */
+ tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
+ if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
+ tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
+ tail_blkno);
Can we always set tail_cpos in one line?
tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
tail_cpos is either the same cluster as i_size or the next cluster and both works for tail_blkno I guess?
+
+ rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters,
+ &ext_flags);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+
+ /* Is there a cluster to zero? */
+ if (!p_cpos)
+ goto out;
For unwritten extent, we also need to clear the pages? If yes, the solution doesn't complete if we have 2 unwritten extent, one contains i_size while one passes i_size. Here we only clear the pages for the 1st unwritten extent and leave the 2nd one untouched.
+
+ pos_cpos = pos>> osb->s_clustersize_bits;
+ mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
+ (unsigned int)tail_cpos, (unsigned int)num_clusters,
+ (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
+ (unsigned long long)pos_blkno);
From here to the call of CoW is a bit hard to understand. In 'if', num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it isn't easy for the reader to tell why these 2 clauses are setting different values. So how about my code below? It looks more straightforward I think.
+ if ((tail_cpos + num_clusters)> pos_cpos) {
+ num_clusters = pos_cpos - tail_cpos;
+ if (pos_blkno>
+ ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
+ num_clusters += 1;
+ } else {
+ blocks_to_zero =
+ ocfs2_clusters_to_blocks(inode->i_sb,
+ tail_cpos + num_clusters);
+ blocks_to_zero -= tail_blkno;
+ }
+
+ /* Now CoW the clusters we're about to zero */
+ if (ext_flags& OCFS2_EXT_REFCOUNTED) {
+ rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
+ num_clusters, UINT_MAX);
+ if (rc) {
+ mlog_errno(rc);
+ goto out;
+ }
+ }
/* Decrease blocks_to_zero if there is some hole after extent */
if (tail_cpos + num_clusters <= pos_cpos) {
blocks_to_zero =
ocfs2_clusters_to_blocks(inode->i_sb,
tail_cpos + num_clusters);
blocks_to_zero -= tail_blkno;
}

/* Now CoW if we have some refcounted clusters. */
if (ext_flags & OCFS2_EXT_REFCOUNTED) {
/*
* We add one more cluster here since it will be
* written shortly and if the pos_blkno isn't aligned
* to the cluster size, we have to zero the blocks
* before it.
*/
if (tail_cpos + num_clusters > pos_cpos)
num_clusters = pos_cpos - tail_cpos + 1;

rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
num_clusters, UINT_MAX);
if (rc) {
mlog_errno(rc);
goto out;
}

}
+
+ *start_blkno = tail_blkno;
+ *blocks = blocks_to_zero;
+ mlog(0, "start_blkno = %llu, blocks = %llu\n",
+ (unsigned long long)(*start_blkno),
+ (unsigned long long)(*blocks));
+
+out:
+ return rc;
+}

Regards,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/