Re: [RFC] [PATCH] Pre-emption control for userspace

From: Khalid Aziz
Date: Tue Mar 04 2014 - 12:46:09 EST


Thanks for the review. Please see my comments inline below.

On 03/04/2014 06:56 AM, Oleg Nesterov wrote:
On 03/03, Khalid Aziz wrote:
kernel/sched/preempt_delay.c | 39 ++++++

Why? This can go into proc/ as well.


Sure. No strong reason to keep these functions in separate file. These functions can go into proc/fs/base.c.

+static void
+close_preempt_delay_vmops(struct vm_area_struct *vma)
+{
+ struct preemp_delay_mmap_state *state;
+
+ state = (struct preemp_delay_mmap_state *) vma->vm_private_data;
+ BUG_ON(!state || !state->task);
+
+ state->page->mapping = NULL;
+ /* point delay request flag pointer back to old flag in task_struct */
+ state->task->sched_preempt_delay.delay_req =
+ &state->task->sched_preempt_delay.delay_flag;
+ state->task->sched_preempt_delay.mmap_state = NULL;
+ vfree(state->kaddr);
+ kfree(state);
+ vma->vm_private_data = NULL;
+}

Suppose that state->task != current. Then this can race with do_exit()
which cleanups ->mmap_state too. OTOH do_exit() unmaps this region, it
is not clear why it can't rely in vm_ops->close().

Hmm. In fact I think do_exit() should crash after munmap? ->mmap_state
should be NULL ?? Perhaps I misread this patch completely...

do_exit() unmaps mmap_state->uaddr, and frees up mmap_state->kaddr and mmap_state. mmap_state should not be NULL after unmap. vfree() and kfree() are tolerant of pointers that have already been freed. On the other hand mmap_state can be NULL in do_exit() if do_exit() and close_preempt_delay_vmops() were to race since close_preempt_delay_vmops() sets mmap_state to NULL just before it frees it up. Could they indeed race, because the thread happens to be killed just as it had called munmap()? I can protect against that with a refcount in mmap_state. Do you feel this is necessary/helpful to do?


+static int
+tid_preempt_delay_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ int retval = 0;
+ void *kaddr = NULL;
+ struct preemp_delay_mmap_state *state = NULL;
+ struct inode *inode = file_inode(file);
+ struct task_struct *task;
+ struct page *page;
+
+ /*
+ * Validate args:
+ * - Only offset 0 support for now
+ * - size should be PAGE_SIZE
+ */
+ if (vma->vm_pgoff != 0 || (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
+ retval = -EINVAL;
+ goto error;
+ }
+
+ /*
+ * Only one mmap allowed at a time
+ */
+ if (current->sched_preempt_delay.mmap_state != NULL) {
+ retval = -EEXIST;
+ goto error;

This assumes that we are going to setup current->sched_preempt_delay.mmap_state,
but what if the task opens /proc/random_tid/sched_preempt_delay ?

Good point. A thread should not be allowed to request preemption delay for another thread. I would recommend leaving this code alone and adding following code before this:

if (get_proc_task(inode) != current) {
retval = -EPERM;
goto error;
}

Sounds reasonable?


+ state = kzalloc(sizeof(struct preemp_delay_mmap_state), GFP_KERNEL);
+ kaddr = vmalloc_user(PAGE_SIZE);

Why vmalloc() ? We only need a single page?


Makes sense. I will switch to get_zeroed_page().

+ task = get_proc_task(inode);

And it seems that nobody does put_task_struct(state->task);

Good catch. I had caught the other two instances of get_proc_task() but missed this one.


+ state->page = page;
+ state->kaddr = kaddr;
+ state->uaddr = (void *)vma->vm_start;

This is used by do_exit(). But ->vm_start can be changed by mremap() ?


Hmm. And mremap() can do vm_ops->close() too. But the new vma will
have the same vm_ops/vm_private_data, so exit_mmap() will try to do
this again... Perhaps I missed something, but I bet this all can't be
right.

Would you say sys_munmap() of mmap_state->uaddr is not even needed since exit_mm() will do this any way further down in do_exit()? If I were to remove this sys_munmap(), that could simplify the race issues as well.


+ state->task = task;
+
+ /* Clear the current delay request flag */
+ task->sched_preempt_delay.delay_flag = 0;
+
+ /* Point delay request flag pointer to the newly allocated memory */
+ task->sched_preempt_delay.delay_req = (unsigned char *)kaddr;
+
+ task->sched_preempt_delay.mmap_state = state;
+ vma->vm_private_data = state;
+ vma->vm_ops = &preempt_delay_vmops;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_SHARED | VM_WRITE;

This probably also needs VM_IO, to protect from madvise(MADV_DOFORK).

Yes, you are right. I will add that.

VM_SHARED/VM_WRITE doesn't look right.

VM_SHARED is wrong but VM_WRITE is needed I think since the thread will write to the mmap'd page to signal to request preemption delay.


Oleg.


I appreciate your taking the time to review this code. Thank you very much.

--
Khalid
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/