mlock'ed pages are paging out to shared mapped files?

From: Bob Walters
Date: Fri Nov 26 2010 - 11:26:06 EST


Please CC directly, I'm not subscribed.

I'm writing to determine if observed mlock behavior is a bug or the intended implementation. The kernel seems to be paging out modifications made to locked pages of a shared (MAP_SHARED) memory mapped regular file, on Linux 2.6.35.6-48 (Fedora 14 distro, x86_64). Man pages implies that this might be a bug. I have a repeatable process involving the following test:

static const int pages = 64;
static const size_t pagesize = getpagesize();
static char *addr = NULL;

// Make sure, on Linux, that the process has the capability to lock memory.
// See man capability(7)
void print_mlock_capability() {
cap_t caps = cap_get_proc();
if (caps == NULL) {
perror("cap_get_proc()");
exit(0);
}

cap_flag_value_t value;
cap_get_flag(caps, CAP_IPC_LOCK, CAP_PERMITTED, &value);
cout << "IPC_LOCK Permitted: " << (value == CAP_SET ? "set" : "clear") << endl;

cap_get_flag(caps, CAP_IPC_LOCK, CAP_EFFECTIVE, &value);
cout << "IPC_LOCK Effective: " << (value == CAP_SET ? "set" : "clear") << endl;
}

int main(const int argc, const char* argv[]) {
print_mlock_capability();

int fd = open("mincore_testmap", O_RDWR | O_CREAT, 0777);
assert(fd != -1);

off_t length = pagesize*pages;
char byte=0;
ssize_t written = pwrite(fd, &byte, sizeof(char), length);
assert(written != -1);

int prot = PROT_READ|PROT_WRITE;
//int flags = MAP_PRIVATE | MAP_NOCACHE;
int flags = MAP_SHARED;
//int flags = MAP_SHARED | MAP_NORESERVE;
addr = (char*)mmap(0, length, prot, flags, fd, 0);
assert(addr != 0);

cout << "Mapping at: " << (void*)addr << endl;
memset(addr,0,pagesize*pages);
msync(addr,length,MS_SYNC);

cout << "Locking page 0 in memory" << endl;
// lock page 0 into memory
int rc = mlock(addr, pagesize);
if (rc != 0) {
perror("mlock");
exit(0);
}

// modify page 0 and 1. 1 should page out, but not 0.
memset(addr, 127, pagesize*2);

cout << "Page 0 and 1 modified, but not synced. Waiting...." << endl;
sleep( 300 );
}

Output:
IPC_LOCK Permitted: set
IPC_LOCK Effective: set
Mapping at: 0x7fb3b0db7000
Page 0 and 1 modified, but not synched. Waiting....

After allowing the process to enter the blocked state (sleep), I wait about a 30-60 seconds, then do a sudden power-off by pulling the plug, not allowing the OS any opportunity to react. I reboot, and can confirm that after reboot, the modifications made to page 0 are seen on disk (via hexdump of the file mincore_test). More specifically: On occasion when doing this, I do reboot and see neither pages 0 and 1 on disk. That's observed if I kill the power quickly after the process reaches the sleep state. However, after a longer wait (10 seconds+), I always see 127s for both pages 0 and 1 on disk, never just page 1. I'm using a journaled ext-4 filesystem. Have checked the file times, after recovery, to rule out the chance that I am seeing a copy from a previous run of this test.

Tried double checking my permission to lock memory. Ended up running as root. capacity(7) shows I have IPC_LOCK permissions. set ulimit -l unlimited. mlock return code is 0. I am running SE linux, changed it's configuration to permissive mode to try to rule it out. Saw no SE Errors. Can't identify any reason why I would fail to lock the pages into memory.

Ultimately, was unable to prevent mlock'ed pages from being paged out. I don't know if this occur with swap as well as mapped files - the problem might be specific to the case of a MAP_SHARED regular file? Don't know.

Is this a bug, or is this the intended behavior. Many resources (man/web) imply that mlock keeps a locked page from paging out to disk (swap specifically). The posix standard seems to imply that it must simply prevents eviction, but is not required to avoid paging out. It is only required to keep the pages memory resident until unlocked. It would be really nice, for my purposes, if there was some way to guarantee that modified pages of a mapped file do not page out prior to some point in time. As a side question: Is there another mechanism which can ensure that.

Best Regards,
Bob




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/