Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

From: Andy Lutomirski
Date: Fri Nov 13 2015 - 14:46:38 EST


On Fri, Nov 13, 2015 at 12:13 AM, Daniel Micay <danielmicay@xxxxxxxxx> wrote:
> On 13/11/15 02:03 AM, Minchan Kim wrote:
>> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>>> easily when we need it. Maybe, that's what you want. Right?
>>>
>>> Yes, but why the access bit instead of the dirty bit for that? It could
>>> always be made more strict (i.e. access bit) in the future, while going
>>> the other way won't be possible. So I think the dirty bit is really the
>>> more conservative choice since if it turns out to be a mistake it can be
>>> fixed without a backwards incompatible change.
>>
>> Absolutely true. That's why I insist on dirty bit until now although
>> I didn't tell the reason. But I thought you wanted to change for using
>> access bit for the future, too. It seems MADV_FREE start to bloat
>> over and over again before knowing real problems and usecases.
>> It's almost same situation with volatile ranges so I really want to
>> stop at proper point which maintainer should decide, I hope.
>> Without it, we will make the feature a lot heavy by just brain storming
>> and then causes lots of churn in MM code without real bebenfit
>> It would be very painful for us.
>
> Well, I don't think you need more than a good API and an implementation
> with no known bugs, kernel security concerns or backwards compatibility
> issues. Configuration and API extensions are something for later (i.e.
> land a baseline, then submit stuff like sysctl tunables). Just my take
> on it though...
>

As long as it's anonymous MAP_PRIVATE only, then the security aspects
should be okay. MADV_DONTNEED seems to work on pretty much any VMA,
and there's been long history of interesting bugs there.

As for dirty vs accessed, an argument in favor of going straight to
accessed is that it means that users can write code like this without
worrying about whether they have a kernel that uses the dirty bit:

x = mmap(...);
*x = 1; /* mark it present */

/* i'm done with it */
*x = 1;
madvise(MADV_FREE, x, ...);

wait a while;

/* is it still there? */
if (*x == 1) {
/* use whatever was cached there */
} else {
/* reinitialize it */
*x = 1;
}

With the dirty bit, this will look like it works, but on occasion
users will lose the race where they probe *x to see if the data was
lost and then the data gets lost before the next write comes in.

Sure, that load from *x could be changed to RMW or users could do a
dummy write (e.g. x[1] = 1; if (*x == 1) ...), but people might forget
to do that, and the caching implications are a little bit worse.

Note that switching to RMW is really really dangerous. Doing:

*x &= 1;
if (*x == 1) ...;

is safe on x86 if the compiler generates:

andl $1, (%[x]);
cmpl $1, (%[x]);

but is unsafe if the compiler generates:

movl (%[x]), %eax;
andl $1, %eax;
movl %eax, (%[x]);
cmpl $1, %eax;

and even worse if the write is omitted when "provably" unnecessary.

OTOH, if switching to the accessed bit is too much of a mess, then
using the dirty bit at first isn't so bad.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/