Re: 5.4.188 and later: massive performance regression with nfsd

From: Chuck Lever III
Date: Sat May 21 2022 - 16:43:36 EST




> On May 21, 2022, at 3:49 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Sat, 2022-05-21 at 19:11 +0000, Chuck Lever III wrote:
>>
>>
>>> On May 21, 2022, at 2:10 PM, Trond Myklebust
>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>
>>> On Sat, 2022-05-21 at 17:22 +0000, Chuck Lever III wrote:
>>>>
>>>>
>>>>> On May 20, 2022, at 7:43 PM, Chuck Lever III
>>>>> <chuck.lever@xxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On May 20, 2022, at 6:24 PM, Trond Myklebust
>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> On Fri, 2022-05-20 at 21:52 +0000, Chuck Lever III wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On May 20, 2022, at 12:40 PM, Trond Myklebust
>>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> On Fri, 2022-05-20 at 15:36 +0000, Chuck Lever III wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On May 11, 2022, at 10:36 AM, Chuck Lever III
>>>>>>>>>> <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On May 11, 2022, at 10:23 AM, Greg KH
>>>>>>>>>>> <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 11, 2022 at 02:16:19PM +0000, Chuck
>>>>>>>>>>> Lever
>>>>>>>>>>> III
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On May 11, 2022, at 8:38 AM, Greg KH
>>>>>>>>>>>>> <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, May 11, 2022 at 12:03:13PM +0200,
>>>>>>>>>>>>> Wolfgang
>>>>>>>>>>>>> Walter
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> starting with 5.4.188 wie see a massive
>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>> regression on our
>>>>>>>>>>>>>> nfs-server. It basically is serving requests
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> slowly with cpu
>>>>>>>>>>>>>> utilization of 100% (with 5.4.187 and earlier
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> 10%) so
>>>>>>>>>>>>>> that it is
>>>>>>>>>>>>>> unusable as a fileserver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The culprit are commits (or one of it):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> c32f1041382a88b17da5736886da4a492353a1bb
>>>>>>>>>>>>>> "nfsd:
>>>>>>>>>>>>>> cleanup
>>>>>>>>>>>>>> nfsd_file_lru_dispose()"
>>>>>>>>>>>>>> 628adfa21815f74c04724abc85847f24b5dd1645
>>>>>>>>>>>>>> "nfsd:
>>>>>>>>>>>>>> Containerise filecache
>>>>>>>>>>>>>> laundrette"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (upstream
>>>>>>>>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63 and
>>>>>>>>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I revert them in v5.4.192 the kernel works
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> performance is
>>>>>>>>>>>>>> ok again.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I did not try to revert them one by one as
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>> disruption
>>>>>>>>>>>>>> of our nfs-server
>>>>>>>>>>>>>> is a severe problem for us and I'm not sure
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> they are
>>>>>>>>>>>>>> related.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 5.10 and 5.15 both always performed very
>>>>>>>>>>>>>> badly on
>>>>>>>>>>>>>> our
>>>>>>>>>>>>>> nfs-
>>>>>>>>>>>>>> server in a
>>>>>>>>>>>>>> similar way so we were stuck with 5.4.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I now think this is because of
>>>>>>>>>>>>>> 36ebbdb96b694dd9c6b25ad98f2bbd263d022b63
>>>>>>>>>>>>>> and/or
>>>>>>>>>>>>>> 9542e6a643fc69d528dfb3303f145719c61d3050
>>>>>>>>>>>>>> though
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> didn't tried to
>>>>>>>>>>>>>> revert them in 5.15 yet.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Odds are 5.18-rc6 is also a problem?
>>>>>>>>>>>>
>>>>>>>>>>>> We believe that
>>>>>>>>>>>>
>>>>>>>>>>>> 6b8a94332ee4 ("nfsd: Fix a write performance
>>>>>>>>>>>> regression")
>>>>>>>>>>>>
>>>>>>>>>>>> addresses the performance regression. It was
>>>>>>>>>>>> merged
>>>>>>>>>>>> into
>>>>>>>>>>>> 5.18-
>>>>>>>>>>>> rc.
>>>>>>>>>>>
>>>>>>>>>>> And into 5.17.4 if someone wants to try that
>>>>>>>>>>> release.
>>>>>>>>>>
>>>>>>>>>> I don't have a lot of time to backport this one
>>>>>>>>>> myself,
>>>>>>>>>> so
>>>>>>>>>> I welcome anyone who wants to apply that commit to
>>>>>>>>>> their
>>>>>>>>>> favorite LTS kernel and test it for us.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> If so, I'll just wait for the fix to get into
>>>>>>>>>>>>> Linus's
>>>>>>>>>>>>> tree as
>>>>>>>>>>>>> this does
>>>>>>>>>>>>> not seem to be a stable-tree-only issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately I've received a recent report that
>>>>>>>>>>>> the
>>>>>>>>>>>> fix
>>>>>>>>>>>> introduces
>>>>>>>>>>>> a "sleep while spinlock is held" for NFSv4.0 in
>>>>>>>>>>>> rare
>>>>>>>>>>>> cases.
>>>>>>>>>>>
>>>>>>>>>>> Ick, not good, any potential fixes for that?
>>>>>>>>>>
>>>>>>>>>> Not yet. I was at LSF last week, so I've just started
>>>>>>>>>> digging
>>>>>>>>>> into this one. I've confirmed that the report is a
>>>>>>>>>> real
>>>>>>>>>> bug,
>>>>>>>>>> but we still don't know how hard it is to hit it with
>>>>>>>>>> real
>>>>>>>>>> workloads.
>>>>>>>>>
>>>>>>>>> We believe the following, which should be part of the
>>>>>>>>> first
>>>>>>>>> NFSD pull request for 5.19, will properly address the
>>>>>>>>> splat.
>>>>>>>>>
>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=for-next&id=556082f5e5d7ecfd0ee45c3641e2b364bff9ee44
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Uh... What happens if you have 2 simultaneous calls to
>>>>>>>> nfsd4_release_lockowner() for the same file? i.e. 2
>>>>>>>> separate
>>>>>>>> processes
>>>>>>>> owned by the same user, both locking the same file.
>>>>>>>>
>>>>>>>> Can't that cause the 'putlist' to get corrupted when both
>>>>>>>> callers
>>>>>>>> add
>>>>>>>> the same nf->nf_putfile to two separate lists?
>>>>>>>
>>>>>>> IIUC, cl_lock serializes the two RELEASE_LOCKOWNER calls.
>>>>>>>
>>>>>>> The first call finds the lockowner in cl_ownerstr_hashtbl
>>>>>>> and
>>>>>>> unhashes it before releasing cl_lock.
>>>>>>>
>>>>>>> Then the second cannot find that lockowner, thus it can't
>>>>>>> requeue it for bulk_put.
>>>>>>>
>>>>>>> Am I missing something?
>>>>>>
>>>>>> In the example I quoted, there are 2 separate processes
>>>>>> running
>>>>>> on the
>>>>>> client. Those processes could share the same open owner +
>>>>>> open
>>>>>> stateid,
>>>>>> and hence the same struct nfs4_file, since that depends only
>>>>>> on
>>>>>> the
>>>>>> process credentials matching. However they will not normally
>>>>>> share a
>>>>>> lock owner, since POSIX does not expect different processes
>>>>>> to
>>>>>> share
>>>>>> locks.
>>>>>>
>>>>>> IOW: The point is that one can relatively easily create 2
>>>>>> different
>>>>>> lock owners with different lock stateids that share the same
>>>>>> underlying
>>>>>> struct nfs4_file.
>>>>>
>>>>> Is there a similar exposure if two different clients are
>>>>> locking
>>>>> the same file? If so, then we can't use a per-nfs4_client
>>>>> semaphore
>>>>> to serialize access to the nf_putfile field.
>>>>
>>>> I had a thought about an alternate approach.
>>>>
>>>> Create a second nfsd_file_put API that is not allowed to sleep.
>>>> Let's call it "nfsd_file_put_async()". Teach check_for_locked()
>>>> to use that instead of nfsd_file_put().
>>>>
>>>> Here's where I'm a little fuzzy: nfsd_file_put_async() could do
>>>> something like:
>>>>
>>>> void nfsd_file_put_async(struct nfsd_file *nf)
>>>> {
>>>> if (refcount_dec_and_test(&nf->nf_ref))
>>>> nfsd_file_close_inode(nf->nf_inode);
>>>> }
>>>>
>>>>
>>>
>>> That approach moves the sync to the garbage collector, which was
>>> exactly what we're trying to avoid in the first place.
>>
>> Totally understood.
>>
>> My thought was that "put" for RELEASE_LOCKOWNER/FREE_STATEID
>> would be unlikely to have any data to sync -- callers that
>> actually have data to flush are elsewhere, and those would
>> continue to use the synchronous nfsd_file_put() API.
>>
>> Do you have a workload where we can test this assumption?
>>
>>
>>> Why not just do this "check_for_locks()" thing differently?
>>>
>>> It really shouldn't be too hard to add something to
>>> nfsd4_lm_get_owner()/nfsd4_lm_put_owner() that bumps a counter in
>>> the
>>> lockowner in order to tell you whether or not locks are still held
>>> instead of doing this bone headed walk through the list of locks.
>>
>> I thought of that a couple weeks ago. That doesn't work
>> because you can lock or unlock by range. That means the
>> symmetry of LOCK and LOCKU is not guaranteed, and I don't
>> believe these calls are used that way anyway. So I
>> abandoned the idea of using get_owner / put_owner.
>>
>
> Then you're misunderstanding how it works. lm_get_owner() is called
> when a lock is initialised from another one. The whole point is to
> ensure that each and every object representing a range lock on the
> inode's list maintains its own private reference to the knfsd lockowner
> (i.e. the fl->fl_owner).
>
> For instance when a LOCK call calls posix_lock_inode(), then that
> function uses locks_copy_conflock() (which calls lm_get_owner) to
> initialise the range lock object that is being put on the inode list.
> If the new lock causes multiple existing locks to be replaced, they all
> call lm_put_owner to release their references to fl->fl_owner as part
> of the process of being freed.
>
> Conversely, when LOCKU causes a range to get split, the two locks that
> replace the old one are both initialised using locks_copy_conflock(),
> so they both call lm_get_owner. The lock that represents the range
> being replaced is then made to call lm_put_owner() when it is freed.
>
> etc, etc...

That is definitely not what it looked like when I traced it. The
reference count managed by get_owner / put_owner did not seem
to be usable for tracking whether a lockowner had locks or not.
The reference count was put pretty quickly after the lm_get_owner
call.

But I'm not interested in an argument. I'll go back and look at
get_owner / put_owner again, because I agree that not having to
traverse the inode's flc_posix list during check_for_locks() would
be awesome sauce.


--
Chuck Lever