Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

From: Dave Chinner
Date: Wed Jan 09 2019 - 20:15:42 EST


On Wed, Jan 09, 2019 at 11:08:57AM +0100, Jiri Kosina wrote:
> On Wed, 9 Jan 2019, Dave Chinner wrote:
>
> > FWIW, I just realised that the easiest, most reliable way to invalidate
> > the page cache over a file range is simply to do a O_DIRECT read on it.
>
> Neat, good catch indeed. Still, it's only the invalidation part, but the
> residency check is the crucial one.
>
> > > Rationale has been provided by Daniel Gruss in this thread -- if the
> > > attacker is left with cache timing as the only available vector, he's
> > > going to be much more successful with mounting hardware cache timing
> > > attack anyway.
> >
> > No, he said:
> >
> > "Restricting mincore() is sufficient to fix the hardware-agnostic
> > part."
> >
> > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and
> > provides exactly the same information about the page cache as mincore.
>
> Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has
> "just" been overlooked. I can't speak for Daniel, but I believe he might
> be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT
> is sufficient ...".

Good luck with restricting RWF_NOWAIT. I eagerly await all the
fstests that exercise both the existing and new behaviours to
demonstrate they work correctly.

> > Timed read/mmap access loops for cache observation are also hardware
> > agnostic, and on fast SSD based storage will only be marginally slower
> > bandwidth than preadv2(RWF_NOWAIT).
> >
> > Attackers will pick whatever leak vector we don't fix, so we either fix
> > them all (which I think is probably impossible without removing caching
> > altogether)
>
> We can't really fix the fact that it's possible to do the timing on the HW
> caches though.

We can't really fix the fact that it's possible to do the timing on
the page cache, either.

> > or we start thinking about how we need to isolate the page cache so that
> > information isn't shared across important security boundaries (e.g. page
> > cache contents are per-mount namespace).
>
> Umm, sorry for being dense, but how would that help that particular attack
> scenario on a system that doesn't really employ any namespacing?

What's your security boundary?

The "detect what code an app is running" exploit is based on
invalidating and then observing how shared, non-user-owned files
mapped with execute privileges change cache residency.

If the security boundary is within the local container, should users
inside that container be allowed to invalidate the cache of
executable files and libraries they don't own? In this case, we
can't stop observation, because that only require read permissions
and high precision timing, hence the only thing that can be done
here is prevent non-owners from invalidating the page cache.

If the security boundary is a namespace or guest VM, then permission
checks don't work - the user may own the file within that container.
This problem now is that the page cache is observable and
controllable from both sides of the fence. Hence the only way to
prevent observation of the code being run in a different namespace
is to prevent the page being shared across both containers.

The exfiltration exploit requires the page cache to be observable
and controllable on both sides of the security boundary. Should
users be able to observe and control the cached pages accessed by a
different container? KSM page deduplication lessons say no. This is
an even harder problem, because page cache residency can be observed
from remote machines....

What scares me is that new features being proposed could make our
exposure a whole lot worse. e.g. the recent virtio-pmem ("fake-dax")
proposal will directly share host page cache pages into guest VMs w/
DAX capability. i.e. the guest directly accesses the host page
cache. This opens up the potential for host page cache timing
attacks from the guest VMs, and potential guest to guest
observation/exploitation is possible if the same files are mapped
into multiple guests....

IOws the two questions here are simply: "What's your security
boundary?" and "Is the page cache visible and controllable on both
sides?".

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx