Use PTE batching to optimize __collapse_huge_page_copy_succeeded().
On arm64, suppose khugepaged is scanning a pte-mapped 2MB THP for collapse.
Then, calling ptep_clear() for every pte will cause a TLB flush for every
contpte block. Instead, clear_ptes() does a contpte_try_unfold_partial()
which will flush the TLB only for the (if any) starting and ending contpte
block, if they partially overlap with the range khugepaged is looking at.
For all arches, there should be a benefit due to batching atomic operations
on mapcounts due to folio_remove_rmap_ptes() and saving some calls.
Signed-off-by: Dev Jain <dev.jain@xxxxxxx>