Re: [bug report & help] arm64: ltp testcase "migrate_pages01" failed

From: Yisheng Xie
Date: Tue Oct 17 2017 - 09:24:08 EST


Hi Will,

On 2017/10/17 17:23, Will Deacon wrote:
> On Tue, Oct 17, 2017 at 10:58:53AM +0800, Tan Xiaojun wrote:
>> I'm not sure if this is the problem on arm64 numa. What do you think ?
>> By the way, this testcase can be successful in any case on x86.
>
> To be honest, this isn't a particularly helpful bug report. I appreciate
> that a test is reporting failure, but it doesn't look like you've spent
> very much effort to understand what the test is trying to do and why it
> thinks it's failed to do it. All I can sensibly do with your bug report
> is run the test myself, and it passes on the systems I have available.
>
> So, you need to:
>
> 1. Understand what the test is doing.
> 2. Figure out which bit isn't doing what it's supposed to
> 3. See if that part can be isolated to trigger the problem
>
> At that point, it should be possible to describe the unexpected behaviour
> at a level which we can actually investigate if necessary.
This test case is to test whether we should migrate successfully if user call
SYSC_migrate_pages with a invalid node. eg, we should 4 node 0-3, and try to
migrate to node 4. And this should return -EINVAL.

however, the kernel will migrate the memory to node 0 and return ok(e.g. 0).
The root cause is for
nodes_subset(*new, node_states[N_MEMORY])

will return true when new = 0x10 and node_states[N_MEMORY]=0xf, MAX_NUMNODES=4.

And this is common issue, and I also can reproduce at certain config on X86-64
e.g. CONFIG_NODES_SHIFT=3 and have 8 node in the system.

IMO, if nbits=4, 0x0 or 0x10, 0xFF..F0 should not a subset of anything, so following
patch may fix this problem:

From: Yisheng Xie <xieyisheng1@xxxxxxxxxx>
Date: Tue, 17 Oct 2017 20:53:55 +0800
Subject: [PATCH] bitmap: fix corner case of bitmap_subset

As Xiaojun reported the ltp of migrate_pages01 will failed in system
whoes has 4 node with CONFIG_NODES_SHIFT=2:

migrate_pages01 0 TINFO : test_invalid_nodes
migrate_pages01 14 TFAIL : migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
migrate_pages01 15 TFAIL : migrate_pages_common.c:55: call succeeded unexpectedly

and the root cause is
nodes_subset(*new, node_states[N_MEMORY])

will return true in the case like: new = 0x10 and node_states[N_MEMORY]=0xf,
MAX_NUMNODES=4.

Fix it by correct the corner case of bitmap_subset, which makes 0x0 or
0x10, 0xFF..F0 not a subset of bitmap when bitmap lenth is 4.

Reported-by: Tan Xiaojun <tanxiaojun@xxxxxxxxxx>
Signed-off-by: Yisheng Xie <xieyisheng1@xxxxxxxxxx>
---
include/linux/bitmap.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f..bc66978 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -283,6 +283,8 @@ static inline int bitmap_intersects(const unsigned long *src1,
static inline int bitmap_subset(const unsigned long *src1,
const unsigned long *src2, unsigned int nbits)
{
+ if(!(*src1 & BITMAP_LAST_WORD_MASK(nbits)))
+ return false;
if (small_const_nbits(nbits))
return ! ((*src1 & ~(*src2)) & BITMAP_LAST_WORD_MASK(nbits));
else
--
1.7.12.4

Thanks
Yisheng Xie

>
> Will
>
>> On 2017/10/16 19:42, Tan Xiaojun wrote:
>>> Hi all,
>>>
>>> I test ltp in Hisilicon D05 board and get a failed result about the testcase "migrate_pages01".
>>>
>>> In fact, The sub testcase "test_invalid_nodes" failed. The testcase is to find a invalid numa node and migrate memory pages to this node via syscall of "migrate_pages".
>>> The expected result of this case is returning "-1", but it actually return "0".
>>>
>>> --------------------------------------------------------
>>> # ./migrate_pages01
>>> migrate_pages01 0 TINFO : test_empty_mask
>>> migrate_pages01 1 TPASS : expected ret success: returned value = 0
>>> migrate_pages01 0 TINFO : test_invalid_pid -1
>>> migrate_pages01 2 TPASS : expected ret success: returned value = -1
>>> migrate_pages01 3 TPASS : expected failure: TEST_ERRNO=ESRCH(3): No such process
>>> migrate_pages01 0 TINFO : test_invalid_pid unused pid
>>> migrate_pages01 4 TPASS : expected ret success: returned value = -1
>>> migrate_pages01 5 TPASS : expected failure: TEST_ERRNO=ESRCH(3): No such process
>>> migrate_pages01 0 TINFO : test_invalid_masksize
>>> migrate_pages01 6 TPASS : expected ret success: returned value = -1
>>> migrate_pages01 7 TPASS : expected failure: TEST_ERRNO=EINVAL(22): Invalid argument
>>> migrate_pages01 0 TINFO : test_invalid_mem -1
>>> migrate_pages01 8 TPASS : expected ret success: returned value = -1
>>> migrate_pages01 9 TPASS : expected failure: TEST_ERRNO=EFAULT(14): Bad address
>>> migrate_pages01 0 TINFO : test_invalid_mem invalid prot
>>> migrate_pages01 10 TPASS : expected ret success: returned value = -1
>>> migrate_pages01 11 TPASS : expected failure: TEST_ERRNO=EFAULT(14): Bad address
>>> migrate_pages01 0 TINFO : test_invalid_mem unmmaped
>>> migrate_pages01 12 TPASS : expected ret success: returned value = -1
>>> migrate_pages01 13 TPASS : expected failure: TEST_ERRNO=EFAULT(14): Bad address
>>> migrate_pages01 0 TINFO : test_invalid_nodes
>>> migrate_pages01 14 TFAIL : migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
>>> migrate_pages01 15 TFAIL : migrate_pages_common.c:55: call succeeded unexpectedly
>>> migrate_pages01 0 TINFO : test_invalid_perm
>>> migrate_pages01 16 TPASS : expected ret success: returned value = -1
>>> migrate_pages01 17 TPASS : expected failure: TEST_ERRNO=EPERM(1): Operation not permitted
>>> --------------------------------------------------------
>>>
>>> I debug and find a interesting thing, this case does not always fail.
>>>
>>> 1) If one or several numa nodes have no memory, this case will run successfully like below:
>>>
>>> --------------------
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>>> node 0 size: 65309 MB
>>> node 0 free: 61650 MB
>>> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
>>> node 1 size: 65404 MB
>>> node 1 free: 61377 MB
>>> node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>>> node 2 size: 65401 MB
>>> node 2 free: 62316 MB
>>> node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
>>> node 3 size: 0 MB
>>> node 3 free: 0 MB
>>> node distances:
>>> node 0 1 2 3
>>> 0: 10 15 20 20
>>> 1: 15 10 20 20
>>> 2: 20 20 10 15
>>> 3: 20 20 15 10
>>> ---------------------
>>>
>>> This testcase will find node number 3 and migrate pages to node 3. And syscall of "migrate_pages" return -1, test succeeded.
>>>
>>> 2) In most cases, all nodes have memory, and the testcase will get non-existent node like node number 4. The syscall of "migrate_pages" should also return -1, but return 0 actually.
>>> So the testcase failed.
>>>
>>> I think it is a problem in arm64. But I am not familiar with numa, so I ask for help from you.
>>>
>>> Thanks.
>>> Xiaojun.
>>>
>>>
>>> .
>>>
>>
>>
>
> .
>