Re: [PATCH] uio: Replace mutex info_lock with percpu_ref

From: Greg KH
Date: Tue May 17 2022 - 03:32:51 EST


On Tue, May 10, 2022 at 01:50:31PM +0800, Guixin Liu wrote:
> If the underlying driver works in parallel, the mutex info_lock in uio
> will force driver to work sequentially, so that become performance
> bottleneck. Lets replace it with percpu_ref for better performance.
>
> Use tcm_loop and tcmu(backstore is file, and I did some work to make tcmu
> work in parallel at uio_write() path) to evaluate performance,
> fio job: fio -filename=/dev/sdb -direct=1 -size=2G -name=1 -thread
> -runtime=60 -time_based -rw=randread -numjobs=16 -iodepth=16 -bs=128k
>
> Without this patch:
> READ: bw=2828MiB/s (2965MB/s), 176MiB/s-177MiB/s (185MB/s-186MB/s),
> io=166GiB (178GB), run=60000-60001msec
>
> With this patch:
> READ: bw=3382MiB/s (3546MB/s), 211MiB/s-212MiB/s (221MB/s-222MB/s),
> io=198GiB (213GB), run=60001-60001msec
>
> Reviewed-by: Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Guixin Liu <kanie@xxxxxxxxxxxxxxxxx>

Why is UIO being used for a block device? Why not use a real block
driver instead that can properly handle the locking issues involved
here?



> ---
> drivers/uio/uio.c | 95 ++++++++++++++++++++++++++++++++++------------
> include/linux/uio_driver.h | 5 ++-
> 2 files changed, 75 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
> index 43afbb7..72c16ba 100644
> --- a/drivers/uio/uio.c
> +++ b/drivers/uio/uio.c
> @@ -24,6 +24,8 @@
> #include <linux/kobject.h>
> #include <linux/cdev.h>
> #include <linux/uio_driver.h>
> +#include <linux/completion.h>
> +#include <linux/percpu-refcount.h>
>
> #define UIO_MAX_DEVICES (1U << MINORBITS)
>
> @@ -218,7 +220,9 @@ static ssize_t name_show(struct device *dev,
> struct uio_device *idev = dev_get_drvdata(dev);
> int ret;
>
> - mutex_lock(&idev->info_lock);
> + if (!percpu_ref_tryget_live(&idev->info_ref))
> + return -EINVAL;
> +

You are now just putting the contention to a per-cpu lock, so any
single-cpu load will have the same issue, right? And your example above
is a single-cpu load, so how is this any faster? Is the mutex going
across all cpus to sync such a load that moving this to a percpu thing
that much better?

And as you have now split this into one-lock-per-cpu instead of
one-lock-per-device, you just broke the situation where multiple threads
are accessing the same device at the same time, right?

You have also changed the functionality of the driver to force userspace
to handle when the lock can not be taken as previously it would always
work and just delay until it did happen. What workflow does that now
affect that always assumed that these code paths would succeed?

Also the kernel test bot found problems with the patch :(

thanks,

greg k-h