Re: page fault scalability patch V12 [0/7]: Overview and performance tests

From: Hugh Dickins
Date: Thu Dec 09 2004 - 13:39:38 EST


On Wed, 1 Dec 2004, Christoph Lameter wrote:
>
> Changes from V11->V12 of this patch:
> - dump sloppy_rss in favor of list_rss (Linus' proposal)
> - keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
>
> This is a series of patches that increases the scalability of
> the page fault handler for SMP. Here are some performance results
> on a machine with 512 processors allocating 32 GB with an increasing
> number of threads (that are assigned a processor each).

Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
before) your mailer or whatever is eating trailing whitespace: trivial
patch attached to apply before yours, removing that whitespace so yours
apply. But what your patches need to apply to would be 2.6.10-mm.

Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
have tested out okay up to 4GB but not above: trivial patch attached.

Your scalability figures show a superb improvement. But they are (I
presume) for the best case: intense initial faulting of distinct areas
of anonymous memory by parallel cpus running a multithreaded process.
This is not a common case: how much do what real-world apps benefit?

Since you also avoid taking the page_table_lock in handle_pte_fault,
there should be some scalability benefit to all kinds of page fault:
do you have any results to show how much (perhaps hard to quantify,
since even tmpfs file faults introduce other scalability issues)?

How do the scalability figures compare if you omit patch 7/7 i.e. revert
the per-task rss complications you added in for Linus? I remain a fan
of sloppy rss, which you earlier showed to be accurate enough (I'd say),
though I guess should be checked on other architectures than your ia64.
I can't see the point of all that added ugliness for numbers which don't
need to be precise - but perhaps there's no way of rearranging fields,
and the point at which mm->(anon_)rss is updated (near up of mmap_sem?),
to avoid destructive cacheline bounce. What I'm asking is, do you have
numbers to support 7/7? Perhaps it's the fact you showed up to 512 cpus
this time, but only up to 32 with sloppy rss? The ratios do look better
with the latest, but the numbers are altogether lower so we don't know.

The split rss patch, if it stays, needs some work. For example,
task_statm uses "get_shared" to total up rss-anon_rss from the tasks,
but assumes mm->rss is already accurate. Scrap the separate get_rss,
get_anon_rss, get_shared functions: just one get_rss to make a single
pass through the tasks adding up both rss and anon_rss at the same time.

I am bothered that every read of /proc/<pid>/status or /proc/<pid>/statm
is going to reread through all of that task_list each time; yet in that
massively parallel case that concerns you, there should be little change
to rss after startup. Perhaps a later optimization would be to avoid
task_list completely for singly threaded processes. I'd like get_rss to
update mm->rss and mm->anon_rss and flag it uptodate to avoid subsequent
task_list iterations, but the locking might defeat your whole purpose.

Updating current->rss in do_anonymous_page, current->anon_rss in
page_add_anon_rmap, is not always correct: ptrace's access_process_vm
uses get_user_pages on another task. You need check that current->mm ==
mm (or vma->vm_mm) before incrementing current->rss or current->anon_rss,
fall back to mm (or vma->vm_mm) in rare case not (taking page_table_lock
for that). You'll also need to check !(current->flags & PF_BORROWED_MM),
to guard against use_mm. Or... just go back to sloppy rss.

Moving to the main patch, 1/7, the major issue I see there is the way
do_anonymous_page does update_mmu_cache after setting the pte, without
any page_table_lock to bracket them together. Obviously no problem on
architectures where update_mmu_cache is a no-op! But although there's
been plenty of discussion, particularly with Ben and Nick, I've not
noticed anything to guarantee that as safe on all architectures. I do
think it's fine for you to post your patches before completing hooks in
all the arches, but isn't this a significant issue which needs to be
sorted before your patches go into -mm? You hazily refer to such issues
in 0/7, but now you need to work with arch maintainers to settle them
and show the patches.

A lesser issue with the reordering in do_anonymous_page: don't you need
to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
the very slight chance that vmscan will pick the page off the LRU and
unmap it before you've counted it in, hitting page_remove_rmap's
BUG_ON(page_mapcount(page) < 0)?

(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)

Where handle_pte_fault does "entry = *pte" without page_table_lock:
you're quite right to passing down precisely that entry to the fault
handlers below, but there's still a problem on the 32bit architectures
supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
of entry may be out of synch. Not a problem for do_anonymous_page, or
anything else relying on ptep_cmpxchg to check; but a problem for
do_wp_page (which could find !pfn_valid and kill the process) and
probably others (harder to think through). Your 4/7 patch for i386 has
an unused atomic get_64bit function from Nick, I think you'll have to
define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.

Hmm, that will only work if you're using atomic set_64bit rather than
relying on page_table_lock in the complementary places which matter.
Which I believe you are indeed doing in your 3level set_pte. Shouldn't
__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?

But by making every set_pte use set_64bit, you are significantly slowing
down many operations which do not need that atomicity. This is quite
visible in the fork/exec/shell results from lmbench on i386 PAE (and is
the only interesting difference, for good or bad, that I noticed with
your patches in lmbench on 2*HT*P4), which run 5-20% slower. There are
no faults on dst mm (nor on src mm) while copy_page_range is copying,
so its set_ptes don't need to be atomic; likewise during zap_pte_range
(either mmap_sem is held exclusively, or it's in the final exit_mmap).
Probably revert set_pte and set_pte_atomic to what they were, and use
set_pte_atomic where it's needed.

Hugh
--- 2.6.10-rc3/include/asm-i386/system.h 2004-11-15 16:21:12.000000000 +0000
+++ linux/include/asm-i386/system.h 2004-11-22 14:44:30.761904592 +0000
@@ -273,9 +273,9 @@ static inline unsigned long __cmpxchg(vo
#define cmpxchg(ptr,o,n)\
((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
(unsigned long)(n),sizeof(*(ptr))))
-
+
#ifdef __KERNEL__
-struct alt_instr {
+struct alt_instr {
__u8 *instr; /* original instruction */
__u8 *replacement;
__u8 cpuid; /* cpuid bit set for replacement */
--- 2.6.10-rc3/include/asm-s390/pgalloc.h 2004-05-10 03:33:39.000000000 +0100
+++ linux/include/asm-s390/pgalloc.h 2004-11-22 14:54:43.704723120 +0000
@@ -99,7 +99,7 @@ static inline void pgd_populate(struct m

#endif /* __s390x__ */

-static inline void
+static inline void
pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
{
#ifndef __s390x__
--- 2.6.10-rc3/mm/memory.c 2004-11-18 17:56:11.000000000 +0000
+++ linux/mm/memory.c 2004-11-22 14:39:33.924030808 +0000
@@ -1424,7 +1424,7 @@ out:
/*
* We are called with the MM semaphore and page_table_lock
* spinlock held to protect against concurrent faults in
- * multithreaded programs.
+ * multithreaded programs.
*/
static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1615,7 +1615,7 @@ static int do_file_page(struct mm_struct
* Fall back to the linear mapping if the fs does not support
* ->populate:
*/
- if (!vma->vm_ops || !vma->vm_ops->populate ||
+ if (!vma->vm_ops || !vma->vm_ops->populate ||
(write_access && !(vma->vm_flags & VM_SHARED))) {
pte_clear(pte);
return do_no_page(mm, vma, address, write_access, pte, pmd);

--- 2.6.10-rc3-cl/include/asm-i386/pgtable-3level.h 2004-12-05 14:01:11.000000000 +0000
+++ linux/include/asm-i386/pgtable-3level.h 2004-12-09 13:17:44.000000000 +0000
@@ -147,7 +147,7 @@ static inline pmd_t pfn_pmd(unsigned lon

static inline int ptep_cmpxchg(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t oldval, pte_t newval)
{
- return cmpxchg((unsigned int *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
+ return cmpxchg8b((unsigned long long *)ptep, pte_val(oldval), pte_val(newval)) == pte_val(oldval);
}