Re: [PATCH v2 1/1] mm: fix interrupt disabled long time inside deferred_init_memmap()

From: Shile Zhang
Date: Tue Mar 03 2020 - 21:34:20 EST


Hi Kirill,

Thanks for your quickly reply!

On 2020/3/4 00:52, Kirill Tkhai wrote:
On 03.03.2020 19:15, Shile Zhang wrote:
When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will
initialise the deferred pages with local interrupts disabled. It is
introduced by commit 3a2d7fa8a3d5 ("mm: disable interrupts while
initializing deferred pages").

The local interrupt will be disabled long time inside
deferred_init_memmap(), depends on memory size.
On machine with NCPUS <= 2, the 'pgdatinit' kthread could be pined on
boot CPU, then the tick timer will stuck long time, which caused the
system wall time inaccuracy.

For example, the dmesg shown that:

[ 0.197975] node 0 initialised, 32170688 pages in 1ms

Obviously, 1ms is unreasonable.
Now, fix it by restore in the pending interrupts inside the while loop.
The reasonable demsg shown likes:

[ 1.069306] node 0 initialised, 32203456 pages in 894ms
The way I understand the original problem, that Pavel fixed:

we need disable irqs in deferred_init_memmap() since this function may be called
in parallel with deferred_grow_zone() called from interrupt handler. So, Pavel
added lock to fix the race.

In case of we temporary unlock the lock, interrupt still be possible,
so my previous proposition returns the problem back.

Now thought again, I think we have to just add:

pgdat_resize_unlock();
pgdat_resize_lock();

instead of releasing interrupts, since in case of we just release them with lock held,
a call of interrupt->deferred_grow_zone() bring us to a deadlock.

So, unlock the lock is must.

Yes, you're right! I missed this point.
Thanks for your comment!


Signed-off-by: Shile Zhang <shile.zhang@xxxxxxxxxxxxxxxxx>
---
mm/page_alloc.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..d3f337f2e089 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1809,8 +1809,12 @@ static int __init deferred_init_memmap(void *data)
* that we can avoid introducing any issues with the buddy
* allocator.
*/
- while (spfn < epfn)
+ while (spfn < epfn) {
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ /* let in any pending interrupts */
+ local_irq_restore(flags);
+ local_irq_save(flags);
+ }
zone_empty:
pgdat_resize_unlock(pgdat, &flags);
I think we need here something like below (untested):

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 79e950d76ffc..323afa9a4db5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1828,7 +1828,7 @@ static int __init deferred_init_memmap(void *data)
{
pg_data_t *pgdat = data;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
- unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+ unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
@@ -1869,8 +1869,18 @@ static int __init deferred_init_memmap(void *data)
* that we can avoid introducing any issues with the buddy
* allocator.
*/
- while (spfn < epfn)
+ while (spfn < epfn) {
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ /*
+ * Release interrupts every 1Gb to give a possibility
+ * a timer to advance jiffies.
+ */
+ if (nr_pages - prev_nr_pages > (1UL << (30 - PAGE_SHIFT))) {
+ prev_nr_pages = nr_pages;
+ pgdat_resize_unlock(pgdat, &flags);
+ pgdat_resize_lock(pgdat, &flags);
+ }
+ }
zone_empty:
pgdat_resize_unlock(pgdat, &flags);

(I believe the comment may be improved more).

Yeah, your patch is better!
I test your code and it works!
But it seems that 1G is still hold the interrupts too long, about 40ms in my env
with Intel(R) Xeon(R) 2.5GHz). I tried other size, it is OK to use 1024 pages (4MB),
which suggested by Andrew's before.

Could you please help to review it again?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..5def66d3ffcd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1768,7 +1768,7 @@ static int __init deferred_init_memmap(void *data)
Â{
ÂÂÂÂÂÂÂ pg_data_t *pgdat = data;
ÂÂÂÂÂÂÂ const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
-ÂÂÂÂÂÂ unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+ÂÂÂÂÂÂ unsigned long spfn = 0, epfn = 0, nr_pages = 0, prev_nr_pages = 0;
ÂÂÂÂÂÂÂ unsigned long first_init_pfn, flags;
ÂÂÂÂÂÂÂ unsigned long start = jiffies;
ÂÂÂÂÂÂÂ struct zone *zone;
@@ -1809,8 +1809,17 @@ static int __init deferred_init_memmap(void *data)
ÂÂÂÂÂÂÂÂ * that we can avoid introducing any issues with the buddy
ÂÂÂÂÂÂÂÂ * allocator.
ÂÂÂÂÂÂÂÂ */
-ÂÂÂÂÂÂ while (spfn < epfn)
+ÂÂÂÂÂÂ while (spfn < epfn) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /*
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * Restore pending interrupts every 1024 pages to give
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * the chance tick timer to advance jiffies.
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (nr_pages - prev_nr_pages > 1024) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ pgdat_resize_unlock(&flags);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ pgdat_resize_lock(&flags);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
+ÂÂÂÂÂÂ }
Âzone_empty:
ÂÂÂÂÂÂÂ pgdat_resize_unlock(pgdat, &flags);