Re: [PATCH] [PATCH -mmotm] cpuset,mm: use seqlock to protect task->mempolicy and mems_allowed (v2) (was: Re: [PATCH V2 4/4] cpuset,mm: update task's mems_allowed lazily)

From: Bob Liu
Date: Sun Mar 28 2010 - 01:30:23 EST


On Thu, Mar 25, 2010 at 9:33 PM, Miao Xie <miaox@xxxxxxxxxxxxxx> wrote:
> on 2010-3-11 19:03, Nick Piggin wrote:
>>> Ok, I try to make a new patch by using seqlock.
>>
>> Well... I do think seqlocks would be a bit simpler because they don't
>> require this checking and synchronizing of this patch.
> Hi, Nick Piggin
>
> I have made a new patch which uses seqlock to protect mems_allowed and mempolicy.
> please review it.
>
> Subject: [PATCH] [PATCH -mmotm] cpuset,mm: use seqlock to protect task->mempolicy and mems_allowed (v2)
>
> Before applying this patch, cpuset updates task->mems_allowed by setting all
> new bits in the nodemask first, and clearing all old unallowed bits later.
> But in the way, the allocator can see an empty nodemask, though it is infrequent.
>
> The problem is following:
> The size of nodemask_t is greater than the size of long integer, so loading
> and storing of nodemask_t are not atomic operations. If task->mems_allowed
> don't intersect with new_mask, such as the first word of the mask is empty
> and only the first word of new_mask is not empty. When the allocator
> loads a word of the mask before
>
> Â Â Â Âcurrent->mems_allowed |= new_mask;
>
> and then loads another word of the mask after
>
> Â Â Â Âcurrent->mems_allowed = new_mask;
>
> the allocator gets an empty nodemask.
>
> Besides that, if the size of nodemask_t is less than the size of long integer,
> there is another problem. when the kernel allocater invokes the following function,
>
> Â Â Â Âstruct zoneref *next_zones_zonelist(struct zoneref *z,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âenum zone_type highest_zoneidx,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ânodemask_t *nodes,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct zone **zone)
> Â Â Â Â{
> Â Â Â Â Â Â Â Â/*
> Â Â Â Â Â Â Â Â * Find the next suitable zone to use for the allocation.
> Â Â Â Â Â Â Â Â * Only filter based on nodemask if it's set
> Â Â Â Â Â Â Â Â */
> Â Â Â Â Â Â Â Âif (likely(nodes == NULL))
> Â Â Â Â Â Â Â Â Â Â Â Â......
> Â Â Â Â Â Â Â else
> Â Â Â Â Â Â Â Â Â Â Â Âwhile (zonelist_zone_idx(z) > highest_zoneidx ||
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â(z->zone && !zref_in_nodemask(z, nodes)))
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âz++;
>
> Â Â Â Â Â Â Â Â*zone = zonelist_zone(z);
> Â Â Â Â Â Â Â Âreturn z;
> Â Â Â Â}
>
> if we change nodemask between two calls of zref_in_nodemask(), such as
> Â Â Â ÂTask1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Task2
> Â Â Â Âzref_in_nodemask(z = node0's z, nodes = 1-2)
> Â Â Â Âzref_in_nodemask return 0
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ânodes = 0
> Â Â Â Âzref_in_nodemask(z = node1's z, nodes = 0)
> Â Â Â Âzref_in_nodemask return 0
> z will overflow.
>
> when the kernel allocater accesses task->mempolicy, there is the same problem.
>
> The following method is used to fix these two problem.
> A seqlock is used to protect task's mempolicy and mems_allowed for configs where
> MAX_NUMNODES > BITS_PER_LONG, and when the kernel allocater accesses nodemask,
> it locks the seqlock and gets the copy of nodemask, then it passes the copy of
> nodemask to the memory allocating function.
>
> Signed-off-by: Miao Xie <miaox@xxxxxxxxxxxxxx>
> ---
> Âinclude/linux/cpuset.h  Â|  79 +++++++++++++++++++++++-
> Âinclude/linux/init_task.h | Â Â8 +++
> Âinclude/linux/sched.h   |  17 ++++-
> Âkernel/cpuset.c      |  97 +++++++++++++++++++++++-------
> Âkernel/exit.c       |  Â4 +
> Âkernel/fork.c       |  Â4 +
> Âmm/hugetlb.c       Â|  22 ++++++-
> Âmm/mempolicy.c      Â| Â144 ++++++++++++++++++++++++++++++++++-----------
> Âmm/slab.c         |  26 +++++++-
> Âmm/slub.c         |  12 ++++-
> Â10 files changed, 341 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a5740fc..e307f89 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -53,8 +53,8 @@ static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
> Â Â Â Âreturn cpuset_node_allowed_hardwall(zone_to_nid(z), gfp_mask);
> Â}
>
> -extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â const struct task_struct *tsk2);
> +extern int cpuset_mems_allowed_intersects(struct task_struct *tsk1,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct task_struct *tsk2);
>
> Â#define cpuset_memory_pressure_bump() Â Â Â Â Â Â Â Â Â Â Â Â Â\
> Â Â Â Âdo { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> @@ -90,9 +90,68 @@ extern void rebuild_sched_domains(void);
>
> Âextern void cpuset_print_task_mems_allowed(struct task_struct *p);
>
> +# if MAX_NUMNODES > BITS_PER_LONG
> +/*
> + * Be used to protect task->mempolicy and mems_allowed when reading them for
> + * page allocation.
> + *
> + * we don't care that the kernel page allocator allocate a page on a node in
> + * the old mems_allowed, which isn't a big deal, especially since it was
> + * previously allowed.
> + *
> + * We just worry whether the kernel page allocator gets an empty mems_allowed
> + * or not. But
> + * Â if MAX_NUMNODES <= BITS_PER_LONG, loading/storing task->mems_allowed are
> + * Â atomic operations. So we needn't do anything to protect the loading of
> + * Â task->mems_allowed in fastpaths.
> + *
> + * Â if MAX_NUMNODES > BITS_PER_LONG, loading/storing task->mems_allowed are
> + * Â not atomic operations. So we use a seqlock to protect the loading of
> + * Â task->mems_allowed in fastpaths.
> + */
> +#define mems_fastpath_lock_irqsave(p, flags) Â Â Â Â Â Â Â Â Â Â Â Â Â \
> + Â Â Â ({ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â Â Â Â Â read_seqbegin_irqsave(&p->mems_seqlock, flags); Â Â Â Â \
> + Â Â Â })
> +
> +#define mems_fastpath_unlock_irqrestore(p, seq, flags) Â Â Â Â Â Â Â Â \
> + Â Â Â ({ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â Â Â Â Â read_seqretry_irqrestore(&p->mems_seqlock, seq, flags); \
> + Â Â Â })
> +
> +#define mems_slowpath_lock_irqsave(p, flags) Â Â Â Â Â Â Â Â Â Â Â Â Â \
> + Â Â Â do { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â Â Â Â Â write_seqlock_irqsave(&p->mems_seqlock, flags); Â Â Â Â \
> + Â Â Â } while (0)
> +
> +#define mems_slowpath_unlock_irqrestore(p, flags) Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â do { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â Â Â Â Â write_sequnlock_irqrestore(&p->mems_seqlock, flags); Â Â\
> + Â Â Â } while (0)
> +# else
> +#define mems_fastpath_lock_irqsave(p, flags) Â Â Â Â Â ({ (void)(flags); 0; })
> +
> +#define mems_fastpath_unlock_irqrestore(p, flags) Â Â Â({ (void)(flags); 0; })
> +
> +#define mems_slowpath_lock_irqsave(p, flags) Â Â Â Â Â Â Â Â Â \
> + Â Â Â do { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â Â Â Â Â task_lock(p); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> + Â Â Â Â Â Â Â (void)(flags); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â } while (0)
> +
> +#define mems_slowpath_unlock_irqrestore(p, flags) Â Â Â Â Â Â Â\
> + Â Â Â do { Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â Â Â Â Â task_unlock(p); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> + Â Â Â Â Â Â Â (void)(flags); Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â } while (0)
> +# endif
> +
> Âstatic inline void set_mems_allowed(nodemask_t nodemask)
> Â{
> + Â Â Â unsigned long flags;
> + Â Â Â mems_slowpath_lock_irqsave(current, flags);
> Â Â Â Âcurrent->mems_allowed = nodemask;
> + Â Â Â mems_slowpath_unlock_irqrestore(current, flags);
> Â}
>
> Â#else /* !CONFIG_CPUSETS */
> @@ -144,8 +203,8 @@ static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
> Â Â Â Âreturn 1;
> Â}
>
> -static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âconst struct task_struct *tsk2)
> +static inline int cpuset_mems_allowed_intersects(struct task_struct *tsk1,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct task_struct *tsk2)
> Â{
> Â Â Â Âreturn 1;
> Â}
> @@ -193,6 +252,18 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> Â{
> Â}
>
> +#define mems_fastpath_lock_irqsave(p, flags) Â Â Â Â Â Â Â Â Â Â Â Â Â \
> + Â Â Â ({ (void)(flags); 0; })
> +
> +#define mems_fastpath_unlock_irqrestore(p, seq, flags) Â Â Â Â Â Â Â Â \
> + Â Â Â ({ (void)(flags); 0; })
> +
> +#define mems_slowpath_lock_irqsave(p, flags) Â Â Â Â Â Â Â Â Â Â Â Â Â \
> + Â Â Â do { (void)(flags); } while (0)
> +
> +#define mems_slowpath_unlock_irqrestore(p, flags) Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â do { (void)(flags); } while (0)
> +
> Â#endif /* !CONFIG_CPUSETS */
>
> Â#endif /* _LINUX_CPUSET_H */
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 1ed6797..0394e20 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -102,6 +102,13 @@ extern struct cred init_cred;
> Â# define INIT_PERF_EVENTS(tsk)
> Â#endif
>
> +#if defined(CONFIG_CPUSETS) && MAX_NUMNODES > BITS_PER_LONG
> +# define INIT_MEM_SEQLOCK(tsk) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> +    .mems_seqlock  = __SEQLOCK_UNLOCKED(tsk.mems_seqlock),
> +#else
> +# define INIT_MEM_SEQLOCK(tsk)
> +#endif
> +
> Â/*
> Â* ÂINIT_TASK is used to set up the first task table, touch at
> Â* your own risk!. Base=0, limit=0x1fffff (=2MB)
> @@ -171,6 +178,7 @@ extern struct cred init_cred;
> Â Â Â ÂINIT_FTRACE_GRAPH Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> Â Â Â ÂINIT_TRACE_RECURSION Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> Â Â Â ÂINIT_TASK_RCU_PREEMPT(tsk) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â\
> + Â Â Â INIT_MEM_SEQLOCK(tsk) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> Â}
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84b8c22..1cf5fd3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1356,8 +1356,9 @@ struct task_struct {
> Â/* Thread group tracking */
> Â Â Â Âu32 parent_exec_id;
> Â Â Â Âu32 self_exec_id;
> -/* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
> - * mempolicy */
> +/* Protection of (de-)allocation: mm, files, fs, tty, keyrings.
> + * if MAX_NUMNODES <= BITS_PER_LONG,it will protect mems_allowed and mempolicy.
> + * Or we use other seqlock - mems_seqlock to protect them. */
> Â Â Â Âspinlock_t alloc_lock;
>
> Â#ifdef CONFIG_GENERIC_HARDIRQS
> @@ -1425,7 +1426,13 @@ struct task_struct {
> Â Â Â Âcputime_t acct_timexpd; /* stime + utime since last update */
> Â#endif
> Â#ifdef CONFIG_CPUSETS
> - Â Â Â nodemask_t mems_allowed; Â Â Â Â/* Protected by alloc_lock */
> +# if MAX_NUMNODES > BITS_PER_LONG
> + Â Â Â /* Protection of mems_allowed, and mempolicy */
> + Â Â Â seqlock_t mems_seqlock;
> +# endif
> + Â Â Â /* if MAX_NUMNODES <= BITS_PER_LONG, Protected by alloc_lock;
> + Â Â Â Â* else Protected by mems_seqlock */
> + Â Â Â nodemask_t mems_allowed;
> Â Â Â Âint cpuset_mem_spread_rotor;
> Â#endif
> Â#ifdef CONFIG_CGROUPS
> @@ -1448,7 +1455,9 @@ struct task_struct {
> Â Â Â Âstruct list_head perf_event_list;
> Â#endif
> Â#ifdef CONFIG_NUMA
> - Â Â Â struct mempolicy *mempolicy; Â Â/* Protected by alloc_lock */
> + Â Â Â /* if MAX_NUMNODES <= BITS_PER_LONG, Protected by alloc_lock;
> + Â Â Â Â* else Protected by mems_seqlock */
> + Â Â Â struct mempolicy *mempolicy;
> Â Â Â Âshort il_next;
> Â#endif
> Â Â Â Âatomic_t fs_excl; Â Â Â /* holding fs exclusive resources */
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index d109467..52e6f51 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -198,12 +198,13 @@ static struct cpuset top_cpuset = {
> Â* from one of the callbacks into the cpuset code from within
> Â* __alloc_pages().
> Â*
> - * If a task is only holding callback_mutex, then it has read-only
> - * access to cpusets.
> + * If a task is only holding callback_mutex or cgroup_mutext, then it has
> + * read-only access to cpusets.
> Â*
> Â* Now, the task_struct fields mems_allowed and mempolicy may be changed
> - * by other task, we use alloc_lock in the task_struct fields to protect
> - * them.
> + * by other task, we use alloc_lock(if MAX_NUMNODES <= BITS_PER_LONG) or
> + * mems_seqlock(if MAX_NUMNODES > BITS_PER_LONG) in the task_struct fields
> + * to protect them.
> Â*
> Â* The cpuset_common_file_read() handlers only hold callback_mutex across
> Â* small pieces of code, such as when reading out possibly multi-word
> @@ -920,6 +921,10 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
> Â* Â Âcall to guarantee_online_mems(), as we know no one is changing
> Â* Â Âour task's cpuset.
> Â*
> + * Â ÂAs the above comment said, no one can change current task's mems_allowed
> + * Â Âexcept itself. so we needn't hold lock to protect task's mems_allowed
> + * Â Âduring this call.
> + *
> Â* Â ÂWhile the mm_struct we are migrating is typically from some
> Â* Â Âother task, the task_struct mems_allowed that we are hacking
> Â* Â Âis for our current task, which must allocate new pages for that
> @@ -947,15 +952,13 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> Â* we structure updates as setting all new allowed nodes, then clearing newly
> Â* disallowed ones.
> Â*
> - * Called with task's alloc_lock held
> + * Called with mems_slowpath_lock held
> Â*/
> Âstatic void cpuset_change_task_nodemask(struct task_struct *tsk,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ânodemask_t *newmems)
> Â{
> - Â Â Â nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> - Â Â Â mpol_rebind_task(tsk, &tsk->mems_allowed);
> - Â Â Â mpol_rebind_task(tsk, newmems);
> Â Â Â Âtsk->mems_allowed = *newmems;
> + Â Â Â mpol_rebind_task(tsk, newmems);
> Â}
>
> Â/*
> @@ -970,6 +973,7 @@ static void cpuset_change_nodemask(struct task_struct *p,
> Â Â Â Âstruct cpuset *cs;
> Â Â Â Âint migrate;
> Â Â Â Âconst nodemask_t *oldmem = scan->data;
> + Â Â Â unsigned long flags;
> Â Â Â ÂNODEMASK_ALLOC(nodemask_t, newmems, GFP_KERNEL);
>
> Â Â Â Âif (!newmems)
> @@ -978,9 +982,9 @@ static void cpuset_change_nodemask(struct task_struct *p,
> Â Â Â Âcs = cgroup_cs(scan->cg);
> Â Â Â Âguarantee_online_mems(cs, newmems);
>
> - Â Â Â task_lock(p);
> + Â Â Â mems_slowpath_lock_irqsave(p, flags);
> Â Â Â Âcpuset_change_task_nodemask(p, newmems);
> - Â Â Â task_unlock(p);
> + Â Â Â mems_slowpath_unlock_irqrestore(p, flags);
>
> Â Â Â ÂNODEMASK_FREE(newmems);
>
> @@ -1375,6 +1379,7 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> Âstatic void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct cpuset *cs)
> Â{
> + Â Â Â unsigned long flags;
> Â Â Â Âint err;
> Â Â Â Â/*
> Â Â Â Â * can_attach beforehand should guarantee that this doesn't fail.
> @@ -1383,9 +1388,10 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
> Â Â Â Âerr = set_cpus_allowed_ptr(tsk, cpus_attach);
> Â Â Â ÂWARN_ON_ONCE(err);
>
> - Â Â Â task_lock(tsk);
> + Â Â Â mems_slowpath_lock_irqsave(tsk, flags);
> Â Â Â Âcpuset_change_task_nodemask(tsk, to);
> - Â Â Â task_unlock(tsk);
> + Â Â Â mems_slowpath_unlock_irqrestore(tsk, flags);
> +
> Â Â Â Âcpuset_update_task_spread_flag(cs, tsk);
>
> Â}
> @@ -2233,7 +2239,15 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> Â*/
> Âint cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> Â{
> - Â Â Â return nodes_intersects(*nodemask, current->mems_allowed);
> + Â Â Â unsigned long flags, seq;
> + Â Â Â int retval;
> +
> + Â Â Â do {
> + Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, flags);
> + Â Â Â Â Â Â Â retval = nodes_intersects(*nodemask, current->mems_allowed);
> + Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> +
> + Â Â Â return retval;
> Â}
>
> Â/*
> @@ -2314,11 +2328,18 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
> Â{
> Â Â Â Âconst struct cpuset *cs; Â Â Â Â/* current cpuset ancestors */
> Â Â Â Âint allowed; Â Â Â Â Â Â Â Â Â Â/* is allocation in zone z allowed? */
> + Â Â Â unsigned long flags, seq;
>
> Â Â Â Âif (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> Â Â Â Â Â Â Â Âreturn 1;
> Â Â Â Âmight_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> - Â Â Â if (node_isset(node, current->mems_allowed))
> +
> + Â Â Â do {
> + Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, flags);
> + Â Â Â Â Â Â Â allowed = node_isset(node, current->mems_allowed);
> + Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> +
> + Â Â Â if (allowed)
> Â Â Â Â Â Â Â Âreturn 1;
> Â Â Â Â/*
> Â Â Â Â * Allow tasks that have access to memory reserves because they have
> @@ -2369,9 +2390,18 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
> Â*/
> Âint __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
> Â{
> + Â Â Â int allowed;
> + Â Â Â unsigned long flags, seq;
> +
> Â Â Â Âif (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> Â Â Â Â Â Â Â Âreturn 1;
> - Â Â Â if (node_isset(node, current->mems_allowed))
> +
> + Â Â Â do {
> + Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, flags);
> + Â Â Â Â Â Â Â allowed = node_isset(node, current->mems_allowed);
> + Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> +
> + Â Â Â if (allowed)
> Â Â Â Â Â Â Â Âreturn 1;
> Â Â Â Â/*
> Â Â Â Â * Allow tasks that have access to memory reserves because they have
> @@ -2438,11 +2468,16 @@ void cpuset_unlock(void)
> Âint cpuset_mem_spread_node(void)
> Â{
> Â Â Â Âint node;
> -
> - Â Â Â node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
> - Â Â Â if (node == MAX_NUMNODES)
> - Â Â Â Â Â Â Â node = first_node(current->mems_allowed);
> - Â Â Â current->cpuset_mem_spread_rotor = node;
> + Â Â Â unsigned long flags, seq;
> +
> + Â Â Â do {
> + Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, flags);
> + Â Â Â Â Â Â Â node = next_node(current->cpuset_mem_spread_rotor,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â current->mems_allowed);
> + Â Â Â Â Â Â Â if (node == MAX_NUMNODES)
> + Â Â Â Â Â Â Â Â Â Â Â node = first_node(current->mems_allowed);
> + Â Â Â Â Â Â Â current->cpuset_mem_spread_rotor = node;
> + Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> Â Â Â Âreturn node;
> Â}
> ÂEXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
> @@ -2458,10 +2493,26 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
> Â* to the other.
> Â**/
>
> -int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âconst struct task_struct *tsk2)
> +int cpuset_mems_allowed_intersects(struct task_struct *tsk1,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct task_struct *tsk2)
> Â{
> - Â Â Â return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
> + Â Â Â unsigned long flags1, flags2;
> + Â Â Â int retval;
> + Â Â Â struct task_struct *tsk;
> +
> + Â Â Â if (tsk1 > tsk2) {
> + Â Â Â Â Â Â Â tsk = tsk1;
> + Â Â Â Â Â Â Â tsk1 = tsk2;
> + Â Â Â Â Â Â Â tsk2 = tsk;
> + Â Â Â }
> +
> + Â Â Â mems_slowpath_lock_irqsave(tsk1, flags1);
> + Â Â Â mems_slowpath_lock_irqsave(tsk2, flags2);
> + Â Â Â retval = nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
> + Â Â Â mems_slowpath_unlock_irqrestore(tsk2, flags2);
> + Â Â Â mems_slowpath_unlock_irqrestore(tsk1, flags1);
> +
> + Â Â Â return retval;
> Â}
>
> Â/**
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 7b012a0..cbf045d 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -16,6 +16,7 @@
> Â#include <linux/key.h>
> Â#include <linux/security.h>
> Â#include <linux/cpu.h>
> +#include <linux/cpuset.h>
> Â#include <linux/acct.h>
> Â#include <linux/tsacct_kern.h>
> Â#include <linux/file.h>
> @@ -649,6 +650,7 @@ static void exit_mm(struct task_struct * tsk)
> Â{
> Â Â Â Âstruct mm_struct *mm = tsk->mm;
> Â Â Â Âstruct core_state *core_state;
> + Â Â Â unsigned long flags;
>
> Â Â Â Âmm_release(tsk, mm);
> Â Â Â Âif (!mm)
> @@ -694,8 +696,10 @@ static void exit_mm(struct task_struct * tsk)
> Â Â Â Â/* We don't want this task to be frozen prematurely */
> Â Â Â Âclear_freeze_flag(tsk);
> Â#ifdef CONFIG_NUMA
> + Â Â Â mems_slowpath_lock_irqsave(tsk, flags);
> Â Â Â Âmpol_put(tsk->mempolicy);
> Â Â Â Âtsk->mempolicy = NULL;
> + Â Â Â mems_slowpath_unlock_irqrestore(tsk, flags);
> Â#endif
> Â Â Â Âtask_unlock(tsk);
> Â Â Â Âmm_update_next_owner(mm);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index fe73f8d..591346a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -32,6 +32,7 @@
> Â#include <linux/capability.h>
> Â#include <linux/cpu.h>
> Â#include <linux/cgroup.h>
> +#include <linux/cpuset.h>
> Â#include <linux/security.h>
> Â#include <linux/hugetlb.h>
> Â#include <linux/swap.h>
> @@ -1075,6 +1076,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> Â Â Â Âp->io_context = NULL;
> Â Â Â Âp->audit_context = NULL;
> Â Â Â Âcgroup_fork(p);
> +#if defined(CONFIG_CPUSETS) && MAX_NUMNODES > BITS_PER_LONG
> + Â Â Â seqlock_init(&p->mems_seqlock);
> +#endif
> Â#ifdef CONFIG_NUMA
> Â Â Â Âp->mempolicy = mpol_dup(p->mempolicy);
> Â Â Â Âif (IS_ERR(p->mempolicy)) {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3a5aeb3..b40cc52 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -465,6 +465,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> Â Â Â Âstruct page *page = NULL;
> Â Â Â Âstruct mempolicy *mpol;
> Â Â Â Ânodemask_t *nodemask;
> + Â Â Â nodemask_t tmp_mask;
> + Â Â Â unsigned long seq, irqflag;
> Â Â Â Âstruct zonelist *zonelist = huge_zonelist(vma, address,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âhtlb_alloc_mask, &mpol, &nodemask);
> Â Â Â Âstruct zone *zone;
> @@ -483,6 +485,15 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> Â Â Â Âif (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
> Â Â Â Â Â Â Â Âreturn NULL;
>
> + Â Â Â if (mpol == current->mempolicy && nodemask) {
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, irqflag);
> + Â Â Â Â Â Â Â Â Â Â Â tmp_mask = *nodemask;
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â seq, irqflag));
> + Â Â Â Â Â Â Â nodemask = &tmp_mask;
> + Â Â Â }
> +

Maybe you can define these to a macro or inline function, I saw
serveral similar places :-)

> Â Â Â Âfor_each_zone_zonelist_nodemask(zone, z, zonelist,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂMAX_NR_ZONES - 1, nodemask) {
> Â Â Â Â Â Â Â Ânid = zone_to_nid(zone);
> @@ -1835,10 +1846,15 @@ __setup("default_hugepagesz=", hugetlb_default_setup);
> Âstatic unsigned int cpuset_mems_nr(unsigned int *array)
> Â{
> Â Â Â Âint node;
> - Â Â Â unsigned int nr = 0;
> + Â Â Â unsigned int nr;
> + Â Â Â unsigned long flags, seq;
>
> - Â Â Â for_each_node_mask(node, cpuset_current_mems_allowed)
> - Â Â Â Â Â Â Â nr += array[node];
> + Â Â Â do {
> + Â Â Â Â Â Â Â nr = 0;
> + Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, flags);
> + Â Â Â Â Â Â Â for_each_node_mask(node, cpuset_current_mems_allowed)
> + Â Â Â Â Â Â Â Â Â Â Â nr += array[node];
> + Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
>
> Â Â Â Âreturn nr;
> Â}
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index dd3f5c5..49abf11 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -187,8 +187,10 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> Â* parameter with respect to the policy mode and flags. ÂBut, we need to
> Â* handle an empty nodemask with MPOL_PREFERRED here.
> Â*
> - * Must be called holding task's alloc_lock to protect task's mems_allowed
> - * and mempolicy. ÂMay also be called holding the mmap_semaphore for write.
> + * Must be called using
> + * Â Â mems_slowpath_lock_irqsave()/mems_slowpath_unlock_irqrestore()
> + * to protect task's mems_allowed and mempolicy. ÂMay also be called holding
> + * the mmap_semaphore for write.
> Â*/
> Âstatic int mpol_set_nodemask(struct mempolicy *pol,
> Â Â Â Â Â Â Â Â Â Â const nodemask_t *nodes, struct nodemask_scratch *nsc)
> @@ -344,9 +346,10 @@ static void mpol_rebind_policy(struct mempolicy *pol,
> Â* Wrapper for mpol_rebind_policy() that just requires task
> Â* pointer, and updates task mempolicy.
> Â*
> - * Called with task's alloc_lock held.
> + * Using
> + * Â Â mems_slowpath_lock_irqsave()/mems_slowpath_unlock_irqrestore()
> + * to protect it.
> Â*/
> -
> Âvoid mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
> Â{
> Â Â Â Âmpol_rebind_policy(tsk->mempolicy, new);
> @@ -644,6 +647,7 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> Â Â Â Âstruct mempolicy *new, *old;
> Â Â Â Âstruct mm_struct *mm = current->mm;
> Â Â Â ÂNODEMASK_SCRATCH(scratch);
> + Â Â Â unsigned long irqflags;
> Â Â Â Âint ret;
>
> Â Â Â Âif (!scratch)
> @@ -662,10 +666,10 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> Â Â Â Â */
> Â Â Â Âif (mm)
> Â Â Â Â Â Â Â Âdown_write(&mm->mmap_sem);
> - Â Â Â task_lock(current);
> + Â Â Â mems_slowpath_lock_irqsave(current, irqflags);
> Â Â Â Âret = mpol_set_nodemask(new, nodes, scratch);
> Â Â Â Âif (ret) {
> - Â Â Â Â Â Â Â task_unlock(current);
> + Â Â Â Â Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> Â Â Â Â Â Â Â Âif (mm)
> Â Â Â Â Â Â Â Â Â Â Â Âup_write(&mm->mmap_sem);
> Â Â Â Â Â Â Â Âmpol_put(new);
> @@ -677,7 +681,7 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> Â Â Â Âif (new && new->mode == MPOL_INTERLEAVE &&
> Â Â Â Â Â Ânodes_weight(new->v.nodes))
> Â Â Â Â Â Â Â Âcurrent->il_next = first_node(new->v.nodes);
> - Â Â Â task_unlock(current);
> + Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> Â Â Â Âif (mm)
> Â Â Â Â Â Â Â Âup_write(&mm->mmap_sem);
>
> @@ -691,7 +695,9 @@ out:
> Â/*
> Â* Return nodemask for policy for get_mempolicy() query
> Â*
> - * Called with task's alloc_lock held
> + * Must be called using mems_slowpath_lock_irqsave()/
> + * mems_slowpath_unlock_irqrestore() to
> + * protect it.
> Â*/
> Âstatic void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
> Â{
> @@ -736,6 +742,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> Â Â Â Âstruct mm_struct *mm = current->mm;
> Â Â Â Âstruct vm_area_struct *vma = NULL;
> Â Â Â Âstruct mempolicy *pol = current->mempolicy;
> + Â Â Â unsigned long irqflags;
>
> Â Â Â Âif (flags &
> Â Â Â Â Â Â Â Â~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
> @@ -745,9 +752,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> Â Â Â Â Â Â Â Âif (flags & (MPOL_F_NODE|MPOL_F_ADDR))
> Â Â Â Â Â Â Â Â Â Â Â Âreturn -EINVAL;
> Â Â Â Â Â Â Â Â*policy = 0; Â Â/* just so it's initialized */
> - Â Â Â Â Â Â Â task_lock(current);
> +
> + Â Â Â Â Â Â Â mems_slowpath_lock_irqsave(current, irqflags);
> Â Â Â Â Â Â Â Â*nmask Â= cpuset_current_mems_allowed;
> - Â Â Â Â Â Â Â task_unlock(current);
> + Â Â Â Â Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> Â Â Â Â Â Â Â Âreturn 0;
> Â Â Â Â}
>
> @@ -803,13 +811,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>
> Â Â Â Âerr = 0;
> Â Â Â Âif (nmask) {
> + Â Â Â Â Â Â Â mems_slowpath_lock_irqsave(current, irqflags);
> Â Â Â Â Â Â Â Âif (mpol_store_user_nodemask(pol)) {
> Â Â Â Â Â Â Â Â Â Â Â Â*nmask = pol->w.user_nodemask;
> Â Â Â Â Â Â Â Â} else {
> - Â Â Â Â Â Â Â Â Â Â Â task_lock(current);
> Â Â Â Â Â Â Â Â Â Â Â Âget_policy_nodemask(pol, nmask);
> - Â Â Â Â Â Â Â Â Â Â Â task_unlock(current);
> Â Â Â Â Â Â Â Â}
> + Â Â Â Â Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> Â Â Â Â}
>
> Âout:
> @@ -1008,6 +1016,7 @@ static long do_mbind(unsigned long start, unsigned long len,
> Â Â Â Âstruct mempolicy *new;
> Â Â Â Âunsigned long end;
> Â Â Â Âint err;
> + Â Â Â unsigned long irqflags;
> Â Â Â ÂLIST_HEAD(pagelist);
>
> Â Â Â Âif (flags & ~(unsigned long)(MPOL_MF_STRICT |
> @@ -1055,9 +1064,9 @@ static long do_mbind(unsigned long start, unsigned long len,
> Â Â Â Â Â Â Â ÂNODEMASK_SCRATCH(scratch);
> Â Â Â Â Â Â Â Âif (scratch) {
> Â Â Â Â Â Â Â Â Â Â Â Âdown_write(&mm->mmap_sem);
> - Â Â Â Â Â Â Â Â Â Â Â task_lock(current);
> + Â Â Â Â Â Â Â Â Â Â Â mems_slowpath_lock_irqsave(current, irqflags);
> Â Â Â Â Â Â Â Â Â Â Â Âerr = mpol_set_nodemask(new, nmask, scratch);
> - Â Â Â Â Â Â Â Â Â Â Â task_unlock(current);
> + Â Â Â Â Â Â Â Â Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> Â Â Â Â Â Â Â Â Â Â Â Âif (err)
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âup_write(&mm->mmap_sem);
> Â Â Â Â Â Â Â Â} else
> @@ -1408,8 +1417,10 @@ static struct mempolicy *get_vma_policy(struct task_struct *task,
> Â Â Â Â Â Â Â Â} else if (vma->vm_policy)
> Â Â Â Â Â Â Â Â Â Â Â Âpol = vma->vm_policy;
> Â Â Â Â}
> +
> Â Â Â Âif (!pol)
> Â Â Â Â Â Â Â Âpol = &default_policy;
> +
> Â Â Â Âreturn pol;
> Â}
>
> @@ -1475,7 +1486,7 @@ static unsigned interleave_nodes(struct mempolicy *policy)
> Â* next slab entry.
> Â* @policy must be protected by freeing by the caller. ÂIf @policy is
> Â* the current task's mempolicy, this protection is implicit, as only the
> - * task can change it's policy. ÂThe system default policy requires no
> + * task can free it's policy. ÂThe system default policy requires no
> Â* such protection.
> Â*/
> Âunsigned slab_node(struct mempolicy *policy)
> @@ -1574,16 +1585,33 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Ânodemask_t **nodemask)
> Â{
> Â Â Â Âstruct zonelist *zl;
> + Â Â Â struct mempolicy policy;
> + Â Â Â struct mempolicy *pol;
> + Â Â Â unsigned long seq, irqflag;
>
> Â Â Â Â*mpol = get_vma_policy(current, vma, addr);
> Â Â Â Â*nodemask = NULL; Â Â Â /* assume !MPOL_BIND */
>
> - Â Â Â if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
> - Â Â Â Â Â Â Â zl = node_zonelist(interleave_nid(*mpol, vma, addr,
> + Â Â Â pol = *mpol;
> + Â Â Â if (pol == current->mempolicy) {
> + Â Â Â Â Â Â Â /*
> + Â Â Â Â Â Â Â Â* get_vma_policy() doesn't return NULL, so we needn't worry
> + Â Â Â Â Â Â Â Â* whether pol is NULL or not.
> + Â Â Â Â Â Â Â Â*/
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, irqflag);
> + Â Â Â Â Â Â Â Â Â Â Â policy = *pol;
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â seq, irqflag));
> + Â Â Â Â Â Â Â pol = &policy;
> + Â Â Â }
> +
> + Â Â Â if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
> + Â Â Â Â Â Â Â zl = node_zonelist(interleave_nid(pol, vma, addr,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âhuge_page_shift(hstate_vma(vma))), gfp_flags);
> Â Â Â Â} else {
> - Â Â Â Â Â Â Â zl = policy_zonelist(gfp_flags, *mpol);
> - Â Â Â Â Â Â Â if ((*mpol)->mode == MPOL_BIND)
> + Â Â Â Â Â Â Â zl = policy_zonelist(gfp_flags, pol);
> + Â Â Â Â Â Â Â if (pol->mode == MPOL_BIND)
> Â Â Â Â Â Â Â Â Â Â Â Â*nodemask = &(*mpol)->v.nodes;
> Â Â Â Â}
> Â Â Â Âreturn zl;
> @@ -1609,11 +1637,14 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> Â{
> Â Â Â Âstruct mempolicy *mempolicy;
> Â Â Â Âint nid;
> + Â Â Â unsigned long irqflags;
>
> Â Â Â Âif (!(mask && current->mempolicy))
> Â Â Â Â Â Â Â Âreturn false;
>
> + Â Â Â mems_slowpath_lock_irqsave(current, irqflags);
> Â Â Â Âmempolicy = current->mempolicy;
> +
> Â Â Â Âswitch (mempolicy->mode) {
> Â Â Â Âcase MPOL_PREFERRED:
> Â Â Â Â Â Â Â Âif (mempolicy->flags & MPOL_F_LOCAL)
> @@ -1633,6 +1664,8 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> Â Â Â Â Â Â Â ÂBUG();
> Â Â Â Â}
>
> + Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> +
> Â Â Â Âreturn true;
> Â}
> Â#endif
> @@ -1722,7 +1755,22 @@ struct page *
> Âalloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
> Â{
> Â Â Â Âstruct mempolicy *pol = get_vma_policy(current, vma, addr);
> + Â Â Â struct mempolicy policy;
> Â Â Â Âstruct zonelist *zl;
> + Â Â Â struct page *page;
> + Â Â Â unsigned long seq, iflags;
> +
> + Â Â Â if (pol == current->mempolicy) {
> + Â Â Â Â Â Â Â /*
> + Â Â Â Â Â Â Â Â* get_vma_policy() doesn't return NULL, so we needn't worry
> + Â Â Â Â Â Â Â Â* whether pol is NULL or not.
> + Â Â Â Â Â Â Â Â*/
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, iflags);
> + Â Â Â Â Â Â Â Â Â Â Â policy = *pol;
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, iflags));
> + Â Â Â Â Â Â Â pol = &policy;
> + Â Â Â }
>
> Â Â Â Âif (unlikely(pol->mode == MPOL_INTERLEAVE)) {
> Â Â Â Â Â Â Â Âunsigned nid;
> @@ -1736,15 +1784,16 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
> Â Â Â Â Â Â Â Â/*
> Â Â Â Â Â Â Â Â * slow path: ref counted shared policy
> Â Â Â Â Â Â Â Â */
> - Â Â Â Â Â Â Â struct page *page = Â__alloc_pages_nodemask(gfp, 0,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â zl, policy_nodemask(gfp, pol));
> + Â Â Â Â Â Â Â page = Â__alloc_pages_nodemask(gfp, 0, zl,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â policy_nodemask(gfp, pol));
> Â Â Â Â Â Â Â Â__mpol_put(pol);
> Â Â Â Â Â Â Â Âreturn page;
> Â Â Â Â}
> Â Â Â Â/*
> Â Â Â Â * fast path: Âdefault or task policy
> Â Â Â Â */
> - Â Â Â return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
> + Â Â Â page = __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
> + Â Â Â return page;
> Â}
>
> Â/**
> @@ -1761,26 +1810,37 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
> Â* Â Â Allocate a page from the kernel page pool. ÂWhen not in
> Â* Â Â interrupt context and apply the current process NUMA policy.
> Â* Â Â Returns NULL when no page can be allocated.
> - *
> - * Â Â Don't call cpuset_update_task_memory_state() unless
> - * Â Â 1) it's ok to take cpuset_sem (can WAIT), and
> - * Â Â 2) allocating for current task (not interrupt).
> Â*/
> Âstruct page *alloc_pages_current(gfp_t gfp, unsigned order)
> Â{
> Â Â Â Âstruct mempolicy *pol = current->mempolicy;
> + Â Â Â struct mempolicy policy;
> + Â Â Â struct page *page;
> + Â Â Â unsigned long seq, irqflags;
> +
>
> Â Â Â Âif (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
> Â Â Â Â Â Â Â Âpol = &default_policy;
> -
> + Â Â Â else {
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, irqflags);
> + Â Â Â Â Â Â Â Â Â Â Â policy = *pol;
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â seq, irqflags));
> + Â Â Â Â Â Â Â pol = &policy;
> + Â Â Â }
> Â Â Â Â/*
> Â Â Â Â * No reference counting needed for current->mempolicy
> Â Â Â Â * nor system default_policy
> Â Â Â Â */
> Â Â Â Âif (pol->mode == MPOL_INTERLEAVE)
> - Â Â Â Â Â Â Â return alloc_page_interleave(gfp, order, interleave_nodes(pol));
> - Â Â Â return __alloc_pages_nodemask(gfp, order,
> - Â Â Â Â Â Â Â Â Â Â Â policy_zonelist(gfp, pol), policy_nodemask(gfp, pol));
> + Â Â Â Â Â Â Â page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> + Â Â Â else
> + Â Â Â Â Â Â Â page = Â__alloc_pages_nodemask(gfp, order,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â policy_zonelist(gfp, pol),
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â policy_nodemask(gfp, pol));
> +
> + Â Â Â return page;
> Â}
> ÂEXPORT_SYMBOL(alloc_pages_current);
>
> @@ -2026,6 +2086,7 @@ restart:
> Â*/
> Âvoid mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
> Â{
> + Â Â Â unsigned long irqflags;
> Â Â Â Âint ret;
>
> Â Â Â Âsp->root = RB_ROOT; Â Â Â Â Â Â /* empty tree == default mempolicy */
> @@ -2043,9 +2104,9 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
> Â Â Â Â Â Â Â Âif (IS_ERR(new))
> Â Â Â Â Â Â Â Â Â Â Â Âgoto put_free; /* no valid nodemask intersection */
>
> - Â Â Â Â Â Â Â task_lock(current);
> + Â Â Â Â Â Â Â mems_slowpath_lock_irqsave(current, irqflags);
> Â Â Â Â Â Â Â Âret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
> - Â Â Â Â Â Â Â task_unlock(current);
> + Â Â Â Â Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> Â Â Â Â Â Â Â Âmpol_put(mpol); /* drop our ref on sb mpol */
> Â Â Â Â Â Â Â Âif (ret)
> Â Â Â Â Â Â Â Â Â Â Â Âgoto put_free;
> @@ -2200,6 +2261,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
> Â Â Â Ânodemask_t nodes;
> Â Â Â Âchar *nodelist = strchr(str, ':');
> Â Â Â Âchar *flags = strchr(str, '=');
> + Â Â Â unsigned long irqflags;
> Â Â Â Âint err = 1;
>
> Â Â Â Âif (nodelist) {
> @@ -2291,9 +2353,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
> Â Â Â Â Â Â Â Âint ret;
> Â Â Â Â Â Â Â ÂNODEMASK_SCRATCH(scratch);
> Â Â Â Â Â Â Â Âif (scratch) {
> - Â Â Â Â Â Â Â Â Â Â Â task_lock(current);
> + Â Â Â Â Â Â Â Â Â Â Â mems_slowpath_lock_irqsave(current, irqflags);
> Â Â Â Â Â Â Â Â Â Â Â Âret = mpol_set_nodemask(new, &nodes, scratch);
> - Â Â Â Â Â Â Â Â Â Â Â task_unlock(current);
> + Â Â Â Â Â Â Â Â Â Â Â mems_slowpath_unlock_irqrestore(current, irqflags);
> Â Â Â Â Â Â Â Â} else
> Â Â Â Â Â Â Â Â Â Â Â Âret = -ENOMEM;
> Â Â Â Â Â Â Â ÂNODEMASK_SCRATCH_FREE(scratch);
> @@ -2487,8 +2549,10 @@ int show_numa_map(struct seq_file *m, void *v)
> Â Â Â Âstruct file *file = vma->vm_file;
> Â Â Â Âstruct mm_struct *mm = vma->vm_mm;
> Â Â Â Âstruct mempolicy *pol;
> + Â Â Â struct mempolicy policy;
> Â Â Â Âint n;
> Â Â Â Âchar buffer[50];
> + Â Â Â unsigned long iflags, seq;
>
> Â Â Â Âif (!mm)
> Â Â Â Â Â Â Â Âreturn 0;
> @@ -2498,6 +2562,18 @@ int show_numa_map(struct seq_file *m, void *v)
> Â Â Â Â Â Â Â Âreturn 0;
>
> Â Â Â Âpol = get_vma_policy(priv->task, vma, vma->vm_start);
> + Â Â Â if (pol == current->mempolicy) {
> + Â Â Â Â Â Â Â /*
> + Â Â Â Â Â Â Â Â* get_vma_policy() doesn't return NULL, so we needn't worry
> + Â Â Â Â Â Â Â Â* whether pol is NULL or not.
> + Â Â Â Â Â Â Â Â*/
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, iflags);
> + Â Â Â Â Â Â Â Â Â Â Â policy = *pol;
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, iflags));
> + Â Â Â Â Â Â Â pol = &policy;
> + Â Â Â }
> +
> Â Â Â Âmpol_to_str(buffer, sizeof(buffer), pol, 0);
> Â Â Â Âmpol_cond_put(pol);
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 09f1572..b8f5acb 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3282,14 +3282,24 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
> Âstatic void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
> Â{
> Â Â Â Âint nid_alloc, nid_here;
> + Â Â Â unsigned long lflags, seq;
> + Â Â Â struct mempolicy mpol;
>
> Â Â Â Âif (in_interrupt() || (flags & __GFP_THISNODE))
> Â Â Â Â Â Â Â Âreturn NULL;
> +
> Â Â Â Ânid_alloc = nid_here = numa_node_id();
> Â Â Â Âif (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
> Â Â Â Â Â Â Â Ânid_alloc = cpuset_mem_spread_node();
> - Â Â Â else if (current->mempolicy)
> - Â Â Â Â Â Â Â nid_alloc = slab_node(current->mempolicy);
> + Â Â Â else if (current->mempolicy) {
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, lflags);
> + Â Â Â Â Â Â Â Â Â Â Â mpol = *(current->mempolicy);
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, lflags));
> +
> + Â Â Â Â Â Â Â nid_alloc = slab_node(&mpol);
> + Â Â Â }
> +
> Â Â Â Âif (nid_alloc != nid_here)
> Â Â Â Â Â Â Â Âreturn ____cache_alloc_node(cachep, flags, nid_alloc);
> Â Â Â Âreturn NULL;
> @@ -3312,11 +3322,21 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> Â Â Â Âenum zone_type high_zoneidx = gfp_zone(flags);
> Â Â Â Âvoid *obj = NULL;
> Â Â Â Âint nid;
> + Â Â Â unsigned long lflags, seq;
> + Â Â Â struct mempolicy mpol;
>
> Â Â Â Âif (flags & __GFP_THISNODE)
> Â Â Â Â Â Â Â Âreturn NULL;
>
> - Â Â Â zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> + Â Â Â if (current->mempolicy) {
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, lflags);
> + Â Â Â Â Â Â Â Â Â Â Â mpol = *(current->mempolicy);
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, lflags));
> + Â Â Â Â Â Â Â zonelist = node_zonelist(slab_node(&mpol), flags);
> + Â Â Â } else
> + Â Â Â Â Â Â Â zonelist = node_zonelist(slab_node(NULL), flags);
> +
> Â Â Â Âlocal_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>
> Âretry:
> diff --git a/mm/slub.c b/mm/slub.c
> index b364844..436c521 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1345,6 +1345,8 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
> Â Â Â Âstruct zone *zone;
> Â Â Â Âenum zone_type high_zoneidx = gfp_zone(flags);
> Â Â Â Âstruct page *page;
> + Â Â Â unsigned long lflags, seq;
> + Â Â Â struct mempolicy mpol;
>
> Â Â Â Â/*
> Â Â Â Â * The defrag ratio allows a configuration of the tradeoffs between
> @@ -1368,7 +1370,15 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
> Â Â Â Â Â Â Â Â Â Â Â Âget_cycles() % 1024 > s->remote_node_defrag_ratio)
> Â Â Â Â Â Â Â Âreturn NULL;
>
> - Â Â Â zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> + Â Â Â if (current->mempolicy) {
> + Â Â Â Â Â Â Â do {
> + Â Â Â Â Â Â Â Â Â Â Â seq = mems_fastpath_lock_irqsave(current, lflags);
> + Â Â Â Â Â Â Â Â Â Â Â mpol = *(current->mempolicy);
> + Â Â Â Â Â Â Â } while (mems_fastpath_unlock_irqrestore(current, seq, lflags));
> + Â Â Â Â Â Â Â zonelist = node_zonelist(slab_node(&mpol), flags);
> + Â Â Â } else
> + Â Â Â Â Â Â Â zonelist = node_zonelist(slab_node(NULL), flags);
> +
> Â Â Â Âfor_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> Â Â Â Â Â Â Â Âstruct kmem_cache_node *n;
>
> --
> 1.6.5.2
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx ÂFor more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>



--
Regards,
-Bob Liu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/