Re: [PATCH 2/2] ceph: quota: fix quota subdir mounts

From: Luis Henriques
Date: Mon Mar 11 2019 - 07:43:13 EST


"Yan, Zheng" <ukernel@xxxxxxxxx> writes:

> On Sat, Mar 9, 2019 at 12:30 AM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>>
>> The CephFS kernel client does not enforce quotas set in a directory that isn't
>> visible from the mount point. For example, given the path '/dir1/dir2', if quotas
>> are set in 'dir1' and the filesystem is mounted with
>>
>> mount -t ceph <server>:<port>:/dir1/ /mnt
>>
>> then the client won't be able to access 'dir1' inode, even if 'dir2' belongs to
>> a quota realm that points to it.
>>
>> This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
>> unknown inodes. Any inode reference obtained this way will be added to a list
>> in ceph_mds_client, and will only be released when the filesystem is umounted.
>>
>> Link: https://tracker.ceph.com/issues/38482
>> Reported-by: Hendrik Peyerl <hpeyerl@xxxxxxxxxxxx>
>> Signed-off-by: Luis Henriques <lhenriques@xxxxxxxx>
>> ---
>> fs/ceph/mds_client.c | 14 +++++++
>> fs/ceph/mds_client.h | 2 +
>> fs/ceph/quota.c | 91 +++++++++++++++++++++++++++++++++++++++-----
>> fs/ceph/super.h | 2 +
>> 4 files changed, 99 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 163fc74bf221..72c5ce5e4209 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -3656,6 +3656,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>> mdsc->max_sessions = 0;
>> mdsc->stopping = 0;
>> atomic64_set(&mdsc->quotarealms_count, 0);
>> + INIT_LIST_HEAD(&mdsc->quotarealms_inodes_list);
>> + spin_lock_init(&mdsc->quotarealms_inodes_lock);
>> mdsc->last_snap_seq = 0;
>> init_rwsem(&mdsc->snap_rwsem);
>> mdsc->snap_realms = RB_ROOT;
>> @@ -3726,9 +3728,21 @@ static void wait_requests(struct ceph_mds_client *mdsc)
>> */
>> void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>> {
>> + struct ceph_inode_info *ci;
>> +
>> dout("pre_umount\n");
>> mdsc->stopping = 1;
>>
>> + spin_lock(&mdsc->quotarealms_inodes_lock);
>> + while(!list_empty(&mdsc->quotarealms_inodes_list)) {
>> + ci = list_first_entry(&mdsc->quotarealms_inodes_list,
>> + struct ceph_inode_info,
>> + i_quotarealms_inode_item);
>> + list_del(&ci->i_quotarealms_inode_item);
>> + iput(&ci->vfs_inode);
>
> iput while holding spinlock is not good

Yep, that's definitely not a good idea. And maybe this loop could even
race with a lookup currently in progress and a new inode could be added
to the list after the cleanup. I didn't verified this race could really
occur, because an easy fix is to simply move this loop into
ceph_mdsc_destroy() where we don't really need the spinlock anymore.

<snip>

>> +restart:
>> down_read(&mdsc->snap_rwsem);
>> - old_realm = get_quota_realm(mdsc, old);
>> - new_realm = get_quota_realm(mdsc, new);
>> + old_realm = get_quota_realm(mdsc, old, true);
>> + /* This needs to be atomic, so we need to hold snap_rwsem */
>
> I don't understand this comment. get_quota_realm() unlock snap_rwsem
> no matter the 'retry' parameter is true or not/

Right, maybe the parameter name and the comment are a bit misleading.

get_quota_realm may need to do an inode lookup and then retry to get the
quota realm. If this lookup is required, it has to drop the rwsem.

However, in ceph_quota_is_same_realm we need to lookup 2 quota realms
"atomically", i.e. with the rwsem held. If get_quota_realm needs to
drop it, it will do the MDS inode lookup anyway but instead of retrying
to get the quota realm it will return -EAGAIN (because 'retry' will be
'false'). This allows for ceph_quota_is_same_realm to restart the
operation itself, and retry to get both quota realms without
get_quota_realm dropping the rwsem.

Does it make sense? I agree the design isn't great :-/
I tried to describe this behavior in get_quota_realm comment, but it's
probably not good enough.

>> + new_realm = get_quota_realm(mdsc, new, false);
>> + if (PTR_ERR(new_realm) == -EAGAIN) {
>> + up_read(&mdsc->snap_rwsem);
>> + if (old_realm)
>> + ceph_put_snap_realm(mdsc, old_realm);
>> + goto restart;
>> + }
>> is_same = (old_realm == new_realm);
>> up_read(&mdsc->snap_rwsem);
>>
>> @@ -166,6 +230,7 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op,
>> return false;
>>
>> down_read(&mdsc->snap_rwsem);
>> +restart:
>> realm = ceph_inode(inode)->i_snap_realm;
>> if (realm)
>> ceph_get_snap_realm(mdsc, realm);
>> @@ -176,9 +241,15 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op,
>> spin_lock(&realm->inodes_with_caps_lock);
>> in = realm->inode ? igrab(realm->inode) : NULL;
>> spin_unlock(&realm->inodes_with_caps_lock);
>> - if (!in)
>> - break;
>> -
>> + if (!in) {
>
> maybe we should distinguish ârealm->inode is null' from 'igrab fails'

Sure, that makes sense.

Cheers,
--
Luis


>
>> + up_read(&mdsc->snap_rwsem);
>> + in = lookup_quotarealm_inode(mdsc, inode->i_sb, realm);
>> + down_read(&mdsc->snap_rwsem);
>> + if (IS_ERR(in))
>> + break;
>> + ceph_put_snap_realm(mdsc, realm);
>> + goto restart;
>> + }
>> ci = ceph_inode(in);
>> spin_lock(&ci->i_ceph_lock);
>> if (op == QUOTA_CHECK_MAX_FILES_OP) {
>> @@ -314,7 +385,7 @@ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, struct kstatfs *buf)
>> bool is_updated = false;
>>
>> down_read(&mdsc->snap_rwsem);
>> - realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root));
>> + realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root), true);
>> up_read(&mdsc->snap_rwsem);
>> if (!realm)
>> return false;
>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>> index ce51e98b08ec..cc7766aeb73b 100644
>> --- a/fs/ceph/super.h
>> +++ b/fs/ceph/super.h
>> @@ -375,6 +375,8 @@ struct ceph_inode_info {
>> struct list_head i_snap_realm_item;
>> struct list_head i_snap_flush_item;
>>
>> + struct list_head i_quotarealms_inode_item;
>> +
>> struct work_struct i_wb_work; /* writeback work */
>> struct work_struct i_pg_inv_work; /* page invalidation work */
>>
>