Re: [PATCH v1] sparc/mm: don't unconditionally set HW writable bit when setting PTE dirty on 64bit

From: David Hildenbrand
Date: Tue Jan 31 2023 - 03:53:43 EST


On 12.12.22 14:02, David Hildenbrand wrote:
On sparc64, there is no HW modified bit, therefore, SW tracks via a SW
bit if the PTE is dirty via pte_mkdirty(). However, pte_mkdirty()
currently also unconditionally sets the HW writable bit, which is wrong.

pte_mkdirty() is not supposed to make a PTE actually writable, unless the
SW writable bit (pte_write()) indicates that the PTE is not
write-protected. Fortunately, sparc64 also defines a SW writable bit.

For example, this already turned into a problem in the context of
THP splitting as documented in commit 624a2c94f5b7 ("Partly revert "mm/thp:
carry over dirty bit when thp splits on pmd") and might be an issue during
page migration in mm/migrate.c:remove_migration_pte() as well where we:
if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
pte = pte_mkdirty(pte);

But more general, anything like:
maybe_mkwrite(pte_mkdirty(pte), vma)
code is broken on sparc64, because it will unconditionally set the HW
writable bit even if the SW writable bit is not set.

Simple reproducer that will result in a writable PTE after ptrace
access, to highlight the problem and as an easy way to verify if it has
been fixed:

--------------------------------------------------------------------------
#include <fcntl.h>
#include <signal.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include <sys/mman.h>

static void signal_handler(int sig)
{
if (sig == SIGSEGV)
printf("[PASS] SIGSEGV generated\n");
else
printf("[FAIL] wrong signal generated\n");
exit(0);
}

int main(void)
{
size_t pagesize = getpagesize();
char data = 1;
off_t offs;
int mem_fd;
char *map;
int ret;

mem_fd = open("/proc/self/mem", O_RDWR);
if (mem_fd < 0) {
fprintf(stderr, "open(/proc/self/mem) failed: %d\n", errno);
return 1;
}

map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE|MAP_ANON, -1 ,0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %d\n", errno);
return 1;
}

printf("original: %x\n", *map);

/* debug access */
offs = lseek(mem_fd, (uintptr_t) map, SEEK_SET);
ret = write(mem_fd, &data, 1);
if (ret != 1) {
fprintf(stderr, "pwrite(/proc/self/mem) failed with %d: %d\n", ret, errno);
return 1;
}
if (*map != data) {
fprintf(stderr, "pwrite(/proc/self/mem) not visible\n");
return 1;
}

printf("ptrace: %x\n", *map);

/* Install signal handler. */
if (signal(SIGSEGV, signal_handler) == SIG_ERR) {
fprintf(stderr, "signal() failed\n");
return 1;
}

/* Ordinary access. */
*map = 2;

printf("access: %x\n", *map);

printf("[FAIL] SIGSEGV not generated\n");

return 0;
}
--------------------------------------------------------------------------

Without this commit (sun4u in QEMU):
# ./reproducer
original: 0
ptrace: 1
access: 2
[FAIL] SIGSEGV not generated

Let's fix this by setting the HW writable bit only if both, the SW dirty
bit and the SW writable bit are set. This matches, for example, how
s390x handles pte_mkwrite() and pte_mkdirty() -- except, that they have
to clear the _PAGE_PROTECT bit.

We have to move pte_dirty() and pte_dirty() up. The code patching
mechanism and handling constants > 22bit is a bit special on sparc64.

With this commit (sun4u in QEMU):
# ./reproducer
original: 0
ptrace: 1
[PASS] SIGSEGV generated

This handling seems to have been in place forever.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
Cc: Peter Xu <peterx@xxxxxxxxxx>
Cc: Hev <r@xxxxxx>
Cc: Anatoly Pugachev <matorola@xxxxxxxxx>
Cc: Raghavendra K T <raghavendra.kt@xxxxxxx>
Cc: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx>
Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Juergen Gross <jgross@xxxxxxxx>
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---

Ping

--
Thanks,

David / dhildenb