Re: NUMA API for Linux

From: Andi Kleen
Date: Fri Apr 09 2004 - 00:41:45 EST


Matthew Dobson <colpatch@xxxxxxxxxx> writes:
>
> Instead of looking up a page's node number by
> page_zone(p)->zone_pgdat->node_id, you can get the same information much
> more efficiently by doing some bit-twidling on page->flags. Use
> page_nodenum(struct page *) from include/linux/mm.h.

That wasn't there when I wrote the code. I will do that change, thanks.

> So it looks like you *are* sharing policies, at least for VMA's in the
> range of a single mbind() call? This is a good start! ;) Looking
> further ahead, I'm a bit confused. It seems despite *sharing* VMA's
> belonging to a single mbind() call, you *copy* VMA's during dup_mmap(),
> copy_process(), split_vma(), and move_vma(). So in the majority of
> cases you duplicate policies instead of sharing them, but you *do* share
> them in some instances? Why the inconsistency?

Locking. policies get locked by their VM.

(the sharing you're advocating would require adding a new lock to
mempolicy)

>> +/* Change policy for a memory range */
>> +asmlinkage long sys_mbind(unsigned long start, unsigned long len,
>> + unsigned long mode,
>> + unsigned long *nmask, unsigned long maxnode,
>> + unsigned flags)
>
> What would you think about giving sys_mbind() a pid argument, like
> sys_sched_setaffinity()? It would make it much easier for sysadmins to
> use the API if they didn't have to rewrite applications to make these
> calls on their own. There's already a plethora of arguments, so one
> more might be overkill.... Just a thought.

It already has 6 arguments - 7 are not allowed. Also playing with a different
process' VM remotely is just a recipe for disaster IMHO (remember
all the security holes that used to be in /proc/pid/mem)

> Why not condense both the sys_mbind() & sys_set_mempolicy() into a
> single call? The functionality of these calls (and even their code) is
> very similar. The main difference is there is no need for looking up
> VMA's and doing locking in the sys_set_mempolicy() case. You overload
> the output side (sys_get_mempolicy()) to handle both whole process and
> memory range options, but you don't do the same on the input
> (sys_mbind() and sys_set_mempolicy()). Saving one syscall and having
> them behave more symmetrically would be a nice addition...

I think it's cleaner to have them separated.

>> +/* Retrieve NUMA policy */
>> +asmlinkage long sys_get_mempolicy(int *policy,
>> + unsigned long *nmask, unsigned long maxnode,
>> + unsigned long addr, unsigned long flags)
>
> I had a thought... Shouldn't all your user pointers be marked as such
> with __user? Ie:

I figure that the people who use the tools who need such ugly annotations
will add them themselves.

>
>> +{
>> + int err, pval;
>> + struct mm_struct *mm = current->mm;
>> + struct vm_area_struct *vma = NULL;
>> + struct mempolicy *pol = current->mempolicy;
>> +
>> + if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>> + return -EINVAL;
>> + if (nmask != NULL && maxnode < numnodes)
>> + return -EINVAL;
>
> Did you mean: if (nmask == NULL || maxnode < numnodes) ?

No, the original test is correct.

> I think that BUG_ON should be BUG_ON(nid < 0), since it *shouldn't* be
> possible to get through that loop with nid >= MAX_NUMNODES. The only
> line that could set nid >= MAX_NUMNODES is nid = find_next_bit(...); and
> the very next line ensures that if nid is in fact >= MAX_NUMNODES it
> will be set to -1. It actually looks pretty redundant altogether,
> but...

Yes, it could be removed.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/