On 2025-05-08 06:45, David Hildenbrand wrote:
On 07.03.25 21:22, Mathieu Desnoyers wrote:
I'm currently perfecting my understanding of the mm code and reviewing
pieces of it as I go, and stumbled on this:
commit 27cacaad16c5 ("mm,memory_hotplug: drop unneeded locking")
This commit removes all users of zone_span_writelock(), thus making
the inline useless, but leaves the now useless
zone_span_seqbegin()/zone_span_seqretry() in place within
page_outside_zone_boundaries().
So I'm confused. What's going on ?
And if this commit got things very wrong when removing the
seqlock, I wonder if there are cases where its partial
pgdat_resize_lock() removal can be an issue as well.
I stumbled over that myself recently as well. I think I mentioned in the
past that we should just store
start_pfn + end_pfn
instead of
start_pfn + nr_pages
Then, concurrent resizing could happen (and we could atomically read
start_pfn / end_pfn).
Right now, when adjusting start_pfn, we always also have to adjust
nr_pages. A concurrent reader calculating end_pfn manually could see
some crappy result.
Having that said, I am not aware of issues in that area, but it all
looks like only a partial cleanup to me.
I wonder if all callers to zone_spans_pfn() prevent concurrent modification
of the zone start_pfn and nr_pages ?
For instance set_pfnblock_flags_mask() (called by set_pageblock_migratetype)
does:
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
If we look at zone_spans_pfn():
static inline unsigned long zone_end_pfn(const struct zone *zone)
{
return zone->zone_start_pfn + zone->spanned_pages;
}
static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
{
return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
}
A concurrent updater which shrinks a zone by moving its start would have
to increment zone_start_pfn *and* decrement spanned_pages. If this happens
concurrently with the loads from zone_spans_pfn(), then it can observe
an intermediate state (only nr pages reduced or only zone start moved).
The first scenario at least can lead to false positive VM_BUG_ON().
Likewise if the updater expands the zone by moving its start left. If
the observer loads an updated start pfn without observing the nr pages
update, it can lead to false positive VM_BUG_ON().
I notice that zone_intersects() also uses zone_end_pfn(). It is used for
instance by default_kernel_zone_for_pfn() without locks. In this case,
reading both nr pages and start pfn concurrently with update could cause
a false-positive match, for instance if the start of the range is moved
but the nr pages prior value is loaded (concurrent shrink). This could
match a zone outside of the function parameter range.
Another reader of those fields is update_pgdat_span(), which appears to
be called only from remove_pfn_range_from_zone with mem_hotplug_lock
held in write mode. So that one should be fine.
AFAIU, updates to nr pages and zone start pfn are done by:
- remove_pfn_range_from_zone (AFAIU always called with mem_hotplug_lock
in write mode),
- shrink_zone_span (called from remove_pfn_range_from_zone),
- resize_zone_range (__meminit function), called from move_pfn_range_to_zone()
also called with mem_hotplug_lock in write mode.
So I think your idea of keeping track of both zone_start_pfn and zone_end_pfn
would solve this specific issue, however we'd have to carefully consider what
happens to users of spanned_pages (e.g. zone_is_empty()) callers, because it
would then require to calculate the spanned_pages from end - start, which
then can have similar issues wrt atomicity against concurrent updates.