Re: [PATCH 0/2] [RFC] Volatile ranges (v4)

From: Dmitry Adamushko
Date: Sat Jul 21 2012 - 07:16:57 EST

[ cc: lkml ]

>> > There is a property of shadow memory that I would like to exploit
>> > - any region of shadow memory can be reset to zero at any point
>> > w/o any bad consequences (it can lead to missed data
>> > races, but it's better than OOM kill).
>> > I've tried to execute madvise(MADV_DONTNEED) every X
>> > seconds for whole shadow memory. It almost works.
>> > The problem is that madvise() seems to be
>> > not atomic, occasionally I see missed writes, that's not acceptable,
>> Just to be sure, you mean that if you do, say
>> *ptr = 1; // [1]
>> ...
>> value = *ptr; // [2] value is 1 here
>> ...
>> *ptr = 2; // possibly from another thread, but we can be certain that
>> it's after [1], perhaps because we checked the content with [2]
>> ...
>> // madvise(..., MADV_DONTNEED); _might_ have been called
>> ...
>> value = *ptr;
>> so here we expect 'value' to be either 2 or 0 (zero page iff madvise()
>> did take place), but you get '1' occasionally?
>> Is that what you mean or something else?
> Yes, that's what I mean.
> Basically I observed inconsistent state of memory that must never happen. I
> had no other explanations except that the madvise() call works as a time
> machine. I executed madvise() every 3 seconds, and the inconsistencies
> happened exactly at the same times. When I turned off madvise(), the
> problems disappear.

Did you try disabling swapping? The "time machine" (if it's not a
problem somewhere else) should take old stuff from somewhere.

mmap's man indicates "zero-fill-on-demand pages for mappings without
an underlying file" and the comment in madvise_dontneed() says "Be
sure to free swap resources too", but then looking at the code in
zap_pte_ranges(), there are a few corner cases where swap entries seem
to be left over intact. In any case, given that you can reproduce it
easily, it'll be a quick check.

Also, it's a MAP_PRIVATE mapping, isn't it?

> I am not sure whether it's a bug or not, because the man says "For the time
> being, the application is finished with the given range", and we are
> obviously do not since we have concurrent accesses. However I would be great
> if it is "fixed".
>> > I need either zero pages or
>> > consistent pages.
>> > Your FADV_VOLATILE looks like it may be the solution.
>> > To summarize: I have a huge region of memory that
>> > I would like to mark as "volatile" at program
>> > startup. The region is anonymous (not backed by any file).
>> > The kernel must be able to take away
>> > any pages in the range, and then return zero pages later.
>> I guess that for the use-cases that people have considered so far,
>> users are supposed to mark regions NONVOLATILE before accessing them
>> again. If I understand correctly, that's not what you want to do...
> No, it's not what I want to do. I can't do any tracking during accesses.
> Ideally I just mark the range during startup, it's also possible to do some
> work on a periodic basis.
>> does it mean that your 'transactions' are always write-a-single-word?
>> i.e. you don't need to make sure that, say, in
>> ptr[0] = val_a;
>> ptr[1] = val_b;
>> ... // no accesses to ptr [0] and [1]
>> c = ptr[0];
>> d = ptr[1];
>> either c == val_a and d == val_b or both are 0?
> Exactly. Any transaction first issues N independent 8-byte atomic reads, and
> then optionally 1 atomic 8-byte write. Value of 0 especially means "no data
> here", because, well, I do not want to setup 40TB of memory to some other
> value :)
>> Also, the current implementation of volatile-ranges will try to 'zap'
>> the whole volatile range... not just some of its pages. Perhaps, it's
>> not something you need too. Of course, this can be addressed one way
>> or another.
> Well, yes, it's not ideal (but we had lived with MADV_DONTNEED for the whole
> range for some time). Ideally, kernel just takes pages away as it needs
> them.
>> Basically, in your specific case, the pages of your region should be,
>> kind of, swapped out to /dev/zero :-)) meaning that once the system
>> decides to swap such a page out, no actual swap is required and, once
>> the area is being accessed again, a zero-page is mapped into it.
> Yes, I believe that's how MADV_DONTNEED currently works (... or
> zero-fill-on-demand pages for mappings without an underlying file).
> Also the pages must be "swapped out" with higher prio.
> I think what I want is somewhat similar to page cache. Kind of best effort
> LRU caching.

-- Dmitry
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at