+ /* The above check should imply these. */
+ VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
+ VM_WARN_ON_ONCE(!PageAnonExclusive(folio_page(folio, 0)));
This can trigger in one nasty case, where we can lose the PAE bit during
swapin (refault from the swapcache while the folio is under writeback, and
the device does not allow for modifying the data while under writeback).
Ugh god wasn't aware of that. So maybe drop this second one?
+
+ /*
+ * A pinned folio implies that it will be used for a duration longer
+ * than that over which the mmap_lock is held, meaning that another part
+ * of the kernel may be making use of this folio.
+ *
+ * Since we are about to manipulate index & mapping fields, we cannot
+ * safely proceed because whatever has pinned this folio may then
+ * incorrectly assume these do not change.
+ */
+ if (folio_maybe_dma_pinned(folio))
+ goto out;
As discussed, this can race with GUP-fast. SO *maybe* we can just allow for
moving these.
I'm guessing you mean as discussed below? :P Or in the cover letter I've not
read yet? :P
Yeah, to be honest you shouldn't be fiddling with index, mapping anyway except
via rmap logic.
I will audit access of these fields just to be safe.
+
+ state.ptep = ptep_start;
+ for (; !pte_done(&state); pte_next(&state, nr_pages)) {
+ pte_t pte = ptep_get(state.ptep);
+
+ if (pte_none(pte) || !pte_present(pte)) {
+ nr_pages = 1;
What if we have
(a) A migration entry (possibly we might fail migration and simply remap the
original folio)
(b) A swap entry with a folio in the swapcache that we can refault.
I don't think we can simply skip these ...
Good point... will investigate these cases.
+ continue;
+ }
+
+ nr_pages = relocate_anon_pte(pmc, &state, undo);
+ if (!nr_pages) {
+ ret = false;
+ goto out;
+ }
+ }
+
+ ret = true;
+out:
+ pte_unmap_unlock(ptep_start, state.ptl);
+ return ret;
+}
+
+static bool __relocate_anon_folios(struct pagetable_move_control *pmc, bool undo)
+{
+ pud_t *pudp;
+ pmd_t *pmdp;
+ unsigned long extent;
+ struct mm_struct *mm = current->mm;
+
+ if (!pmc->len_in)
+ return true;
+
+ for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
+ pmd_t pmd;
+ pud_t pud;
+
+ extent = get_extent(NORMAL_PUD, pmc);
+
+ pudp = get_old_pud(mm, pmc->old_addr);
+ if (!pudp)
+ continue;
+ pud = pudp_get(pudp);
+
+ if (pud_trans_huge(pud) || pud_devmap(pud))
+ return false;
We don't support PUD-size THP, why to we have to fail here?
This is just to be in line with other 'magical future where we have PUD THP'
stuff in mremap.c.
A later commit that permits huge folio support actually lets us support these...
+
+ extent = get_extent(NORMAL_PMD, pmc);
+ pmdp = get_old_pmd(mm, pmc->old_addr);
+ if (!pmdp)
+ continue;
+ pmd = pmdp_get(pmdp);
+
+ if (is_swap_pmd(pmd) || pmd_trans_huge(pmd) ||
+ pmd_devmap(pmd))
+ return false;
Okay, this case could likely be handled later (present anon folio or
migration entry; everything else, we can skip).
Hmm, but how? the PMD cannot be traversed in this case?
'Present' migration entry? Migration entries are non-present right? :) Or is it
different at PMD?
pmc.new = new_vma;
+ if (relocate_anon) {
+ lock_new_anon_vma(new_vma);
+ pmc.relocate_locked = new_vma;
+
+ if (!relocate_anon_folios(&pmc, /* undo= */false)) {
+ unsigned long start = new_vma->vm_start;
+ unsigned long size = new_vma->vm_end - start;
+
+ /* Undo if fails. */
+ relocate_anon_folios(&pmc, /* undo= */true);
You'd assume this cannot fail, but I think it can: imagine concurrent
GUP-fast ...
Well if we change the racey code to ignore DMA pinned we should be ok right?
I really wish we can find a way to not require the fallback.
Yeah the fallback is horrible but we really do need it. See the page table move
fallback code for nightmares also :)
We could also alternatively:
- Have some kind of anon_vma fragmentation where some folios in range reference
a different anon_vma that we link to the original VMA (quite possibly very
broken though).
- Keep a track of folios somehow and separate them from the page table walk (but
then we risk races)
- Have some way of telling the kernel that such a situation exists with a new
object that can be pointed to by folio->mapping, that the rmap code recognise,
like essentially an 'anon_vma migration entry' which can fail.
I already considered combining this operation with the page table move
operation, but the locking gets horrible and the undo is categorically much
worse and I'm not sure it's actually workable.