Re: [PATCH v4 1/4] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags

From: Mel Gorman
Date: Thu Sep 03 2020 - 05:37:04 EST


On Thu, Sep 03, 2020 at 04:32:54PM +0800, Alex Shi wrote:
>
>
> ??? 2020/9/3 ??????3:24, Mel Gorman ??????:
> > On Thu, Sep 03, 2020 at 03:01:20PM +0800, Alex Shi wrote:
> >> pageblock_flags is used as long, since every pageblock_flags is just 4
> >> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
> >> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
> >> flags. It would cause long waiting for sync.
> >>
> >> If we could change the pageblock_flags variable as char, we could use
> >> char size cmpxchg, which just sync up with 2 pageblock flags. it could
> >> relief the false sharing in cmpxchg.
> >>
> >> Signed-off-by: Alex Shi <alex.shi@xxxxxxxxxxxxxxxxx>
> >
> > Page block types were not known to change at high frequency that would
> > cause a measurable performance drop. If anything, the performance hit
> > from pageblocks is the lookup paths which is a lot more frequent.
>
> Yes, it is not hot path. But it's still a meaningful points to reduce cmpxchg
> level false sharing which isn't right on logical.
>

Except there is no guarantee that false sharing was reduced. cmpxchg is
still used except using the byte as the comparison for the old value
and in some cases, that width will still be 32-bit for the exchange.
It would be up to each architecture to see if that can be translated to a
better instruction but it may not even matter. As the instruction will
be prefixed with the lock instruction, the bus will be locked and bus
locking is probably on the cache line boundary so there is a collision
anyway while the atomic update takes place.

End result -- reducing false sharing in this case is not guaranteed to help
performance and may not be detectable when it's a low frequency operation
but the code will behave differently depending on the architecture and
CPU family.

Your justification path measured the number of times a cmpxchg was retried
but it did not track how many pageblock updates there were or how many
potentially collided. As the workload is uncontrolled with respect to
pageblock updates, you do not know if the difference in retries is due to
a real reduction in collisions or a difference in the number of pageblock
updates that potentially collide. Either way, because the frequency of
the operation was so low relative too your target load, any difference
in performance would be indistinguishable from noise.

I don't think it's worth the churn in this case for an impact that will
be very difficult to detect and variable across architectures and CPU
families.

--
Mel Gorman
SUSE Labs