Subject: [PATCH] mm/mempolicy.c linux-2.6.12.3

From: John Blackwood
Date: Fri Jul 22 2005 - 11:02:15 EST


Hello Andrea,

I believe that we are seeing a problem with one change from the patch

2005/01/03 20:15:21-08:00 andrea@xxxxxxxxxx
[PATCH] mempolicy optimisation

in the mpol_free_shared_policy() routine in mm/mempolicy.c, where a
rb_erase() line was removed.

The corresponding portion of the original patch is shown below:

@@ -1086,11 +1084,11 @@
while (next) {
n = rb_entry(next, struct sp_node, nd);
next = rb_next(&n->nd);
- rb_erase(&n->nd, &p->root);
mpol_free(n->policy);
kmem_cache_free(sn_cache, n);
}
spin_unlock(&p->lock);
+ p->root = RB_ROOT;
}


When we build a 2.6.11.4 debug kernel on a 2 cpu NUMA-enabled opteron
system, and run the following set of commands:

echo "1" > /tmp/numatest
numactl --length=0x4000 --shm /tmp/numatest --localalloc
numactl --length=0x2000 --offset=0 --shm /tmp/numatest --membind=0
numactl --length=0x2000 --offset=0x2000 --shm /tmp/numatest --membind=1
ipcs
ipcrm -M "the_key_value_of_this_shm_area"


On the ipcrm call above, the system will oops:

general protection fault: 0000 [1] PREEMPT SMP

Entering kdb (current=0xffff81008161a0f0, pid 3102) on processor 1 Oops: <NULL>
due to oops @ 0xffffffff802e51f3
r15 = 0x0000000000000100 r14 = 0xffff81007d7a5470
r13 = 0xffff810001eea460 r12 = 0xffff81007c4324f8
rbp = 0xffff81007c4324f8 rbx = 0xffff81007c4324f8
r11 = 0x0000000000000202 r10 = 0x00002aaaaabc3db0
r9 = 0xffff81007c4324a8 r8 = 0x0000000000000000
rax = 0x000000000000000f rcx = 0x0000000000000000
rdx = 0x0000000000000000 rsi = 0x000000000000006b
rdi = 0x6b6b6b6b6b6b6b6b orig_rax = 0xffffffffffffffff
rip = 0xffffffff802e51f3 cs = 0x0000000000000010
eflags = 0x0000000000010202 rsp = 0xffff81007d06bc40
ss = 0xffff81007d06a000 &regs = 0xffff81007d06bba8
[1]kdb> bt
Stack traceback for pid 3102
0xffff81008161a0f0 3102 2761 1 1 R 0xffff81008161a480 *ipcrm
RSP RIP Function (args)
0xffff81007d06bc40 0xffffffff802e51f3 rb_next+0xb (0xffff810001eea7d8, 0xffff810001eea518, 0xffff810001eea518, 0x0, 0xffff810001eea518)
0xffff81007d06bc58 0xffffffff80189c74 mpol_free_shared_policy+0x3a
0xffff81007d06bc88 0xffffffff8018d306 shmem_destroy_inode+0x1f
0xffff81007d06bc98 0xffffffff801ab0e8 destroy_inode+0x31 (0xffff81007d7a5484)
0xffff81007d06bca8 0xffffffff801ac5a6 generic_delete_inode+0x10c
0xffff81007d06bcc8 0xffffffff801ac796 iput+0x77 (0xffff8100815673c8)
0xffff81007d06bcd8 0xffffffff801a915c dput+0x1c0 (0xffff81007d3e8408, 0x10000, 0x0, 0x0, 0xffff81007d3e8408)
0xffff81007d06bcf8 0xffffffff80190a9a __fput+0x128 (0xffff81007d3e8408)
0xffff81007d06bd28 0xffffffff8019096d fput+0x14
0xffff81007d06bd38 0xffffffff802d1a8c shm_destroy+0x4d
0xffff81007d06bd48 0xffffffff802d26d9 sys_shmctl+0x689


The rdi = 0x6b6b6b6b6b6b6b6b above (and additional debugging that I did)
shows that we called rb_next(&n->nd) in the while loop and we got back
a rb_node that we already just deallocated via kmem_cache_free() in that
same while loop execution.

As a result, on the debug kernel, the already deallocated rb_node has
0x6b6b6b6b6b6b6b6b in its memory locations and we oops when we try to
use this value as a pointer.

When I put back the rb_erase() line, the above example test, and lots
of others like it, start working again w/out oops-ing.

I'm no rb tree expert, but it seems that we still need to have the
rb_erase() line in the while loop:

diff -u linux-2.6.12.3/mm/mempolicy.c new/mm/mempolicy.c
--- linux-2.6.12.3/mm/mempolicy.c 2005-07-15 17:18:57.000000000 -0400
+++ new/mm/mempolicy.c 2005-07-22 11:12:56.000000000 -0400
@@ -1104,6 +1104,7 @@
while (next) {
n = rb_entry(next, struct sp_node, nd);
next = rb_next(&n->nd);
+ rb_erase(&n->nd, &p->root);
mpol_free(n->policy);
kmem_cache_free(sn_cache, n);
}



Thank you for your time and considerations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/