Re: [PATCH] mm: be more verbose for alloc_contig_range faliures

From: David Hildenbrand
Date: Thu Mar 04 2021 - 12:25:41 EST


You want to debug something, so you try triggering it and capturing debug
data. There are not that many alloc_contig_range() users such that this
would really be an issue to isolate ...

cma_alloc uses alloc_contig_range and cma_alloc has lots of users.
Even, it is expoerted by dmabuf so any userspace would trigger the
allocation by their own. Some of them could be tolerant for the failure,
rest of them could be critical. We should't expect it by limited kernel
usecase.

Assume you are debugging allocation failures. You either collect the data yourself or ask someone to send you that output. You care about any alloc_contig_range() allocation failures that shouldn't happen, don't you?



Strictly speaking: any allocation failure on ZONE_MOVABLE or CMA is
problematic (putting aside NORETRY logic and similar aside). So any such
page you hit is worth investigating and, therefore, worth getting logged for
debugging purposes.

If you believe the every alloc_contig_range failure is problematic

Every one where we should have guarantees I guess: ZONE_MOVABLE or MIGRAT_CMA. On ZONE_NORMAL, there are no guarantees.

and there is no such realy example I menionted above in the world,
I am happy to put this chunk to support dynamic debugging.
Okay?

+#if defined(CONFIG_DYNAMIC_DEBUG) || \
+ (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
+static DEFINE_RATELIMIT_STATE(alloc_contig_ratelimit_state,
+ DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
+int alloc_contig_ratelimit(void)
+{
+ return __ratelimit(&alloc_contig_ratelimit_state);
+}
+

^ do we need ratelimiting with dynamic debugging enabled?

+void dump_migrate_failure_pages(struct list_head *page_list)
+{
+ DEFINE_DYNAMIC_DEBUG_METADATA(descriptor,
+ "migrate failure");
+ if (DYNAMIC_DEBUG_BRANCH(descriptor) &&
+ alloc_contig_ratelimit()) {
+ struct page *page;
+
+ WARN(1, "failed callstack");
+ list_for_each_entry(page, page_list, lru)
+ dump_page(page, "migration failure");

Are all pages on the list guaranteed to be problematic, or only the first entry? I assume all.

+ }
+}
+#else
+static inline void dump_migrate_failure_pages(struct list_head *page_list)
+{
+}
+#endif
+
/* [start, end) must belong to a single zone. */
static int __alloc_contig_migrate_range(struct compact_control *cc,
unsigned long start, unsigned long end)
@@ -8496,6 +8522,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
}
if (ret < 0) {
+ dump_migrate_failure_pages(&cc->migratepages);
putback_movable_pages(&cc->migratepages);
return ret;
}



If that's the way dynamic debugging is configured/enabled (still have to look into it) - yes, that goes into the right direction. As I said above, you should dump only where we have some kind of guarantees I assume.

--
Thanks,

David / dhildenb