Re: [PATCH v3 0/4] Implement FUTEX_WAIT_MULTIPLE operation

From: Andrà Almeida
Date: Thu Mar 05 2020 - 09:50:48 EST


On 2/29/20 7:51 AM, Hillf Danton wrote:
>
> On Thu, 13 Feb 2020 18:45:21 -0300 Andre Almeida wrote:
>>
>> When a write or a read operation in an eventfd file succeeds, it will try
>> to wake up all threads that are waiting to perform some operation to
>> the file. The lock (ctx->wqh.lock) that hold the access to the file value
>> (ctx->count) is the same lock used to control access the waitqueue. When
>> all those those thread woke, they will compete to get this lock. Along
>> with that, the poll() also manipulates the waitqueue and need to hold
>> this same lock. This lock is specially hard to acquire when poll() calls
>> poll_freewait(), where it tries to free all waitqueues associated with
>> this poll. While doing that, it will compete with a lot of read and
>> write operations that have been waken.
>
> Want to know if a rwsem is likely to help mitigate the tension between the
> readers and writers in your workloads.
>

Thanks for the suggestion Hillf. However, keep in mind that the lock
that is causing the tension (ctx->wqh.lock) is shared between eventfd()
and poll() syscall, so a solution to help mitigate the tension would
need to be shared between both codes. And since wqh.lock isn't a
read/write lock, even if a lot of readers manage to get this lock in
parallel, they will compete wqh.lock

> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -40,6 +40,7 @@ struct eventfd_ctx {
> __u64 count;
> unsigned int flags;
> int id;
> + struct rw_semaphore rwsem;
> };
>
> /**
> @@ -212,6 +213,8 @@ static ssize_t eventfd_read(struct file
> if (count < sizeof(ucnt))
> return -EINVAL;
>
> + /* take sem in write mode for event read */
> + down_write(&ctx->rwsem);
> spin_lock_irq(&ctx->wqh.lock);
> res = -EAGAIN;
> if (ctx->count > 0)
> @@ -229,7 +232,9 @@ static ssize_t eventfd_read(struct file
> break;
> }
> spin_unlock_irq(&ctx->wqh.lock);
> + up_write(&ctx->rwsem);
> schedule();
> + down_write(&ctx->rwsem);
> spin_lock_irq(&ctx->wqh.lock);
> }
> __remove_wait_queue(&ctx->wqh, &wait);
> @@ -241,6 +246,7 @@ static ssize_t eventfd_read(struct file
> wake_up_locked_poll(&ctx->wqh, EPOLLOUT);
> }
> spin_unlock_irq(&ctx->wqh.lock);
> + up_write(&ctx->rwsem);
>
> if (res > 0 && put_user(ucnt, (__u64 __user *)buf))
> return -EFAULT;
> @@ -262,6 +268,8 @@ static ssize_t eventfd_write(struct file
> return -EFAULT;
> if (ucnt == ULLONG_MAX)
> return -EINVAL;
> + /* take sem in read mode for event write */
> + down_read(&ctx->rwsem);
> spin_lock_irq(&ctx->wqh.lock);
> res = -EAGAIN;
> if (ULLONG_MAX - ctx->count > ucnt)
> @@ -279,7 +287,9 @@ static ssize_t eventfd_write(struct file
> break;
> }
> spin_unlock_irq(&ctx->wqh.lock);
> + up_read(&ctx->rwsem);
> schedule();
> + down_read(&ctx->rwsem);
> spin_lock_irq(&ctx->wqh.lock);
> }
> __remove_wait_queue(&ctx->wqh, &wait);
> @@ -291,6 +301,7 @@ static ssize_t eventfd_write(struct file
> wake_up_locked_poll(&ctx->wqh, EPOLLIN);
> }
> spin_unlock_irq(&ctx->wqh.lock);
> + up_read(&ctx->rwsem);
>
> return res;
> }
> @@ -408,6 +419,7 @@ static int do_eventfd(unsigned int count
> init_waitqueue_head(&ctx->wqh);
> ctx->count = count;
> ctx->flags = flags;
> + init_rwsem(&ctx->rwsem);
> ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
>
> fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
>