Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg

From: Amir Goldstein
Date: Mon Jan 22 2018 - 15:31:27 EST


On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@xxxxxxx> wrote:
>> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>>
>>>
>>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>> >>
>>> >>
>>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>>> >>>[Sorry for the late reply]
>>> >>>
>>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>> >>>[...]
>>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>>> >>>>>case. And, the other question is do we know who is the listener if it
>>> >>>>>doesn't read the events?
>>> >>>>
>>> >>>>So you never know who will read from the notification file descriptor but
>>> >>>>you can simply account that to the process that created the notification
>>> >>>>group and that is IMO the right process to account to.
>>> >>>
>>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>>> >>>those objects then this should be a target of the charge.
>>> >>>
>>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>>> >>>>different memcg than the one of the running process. However I *think* it
>>> >>>>should be possible to add such interface. Michal?
>>> >>>
>>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>>> >>>right?
>>> >>
>>> >>Yes.
>>> >>
>>> >>I took a look at the implementation and the callsites of
>>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>>> >>
>>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>>> >>* allocate new slab page, charge to memcg_params.memcg
>>> >>
>>> >>I think this is the plumbing you mentioned, right?
>>> >
>>> >Maybe I have misunderstood, but you are using slab allocator. So you
>>> >would need to force it to use a different charging context than current.
>>>
>>> Yes.
>>>
>>> >I haven't checked deeply but this doesn't look trivial to me.
>>>
>>> I agree. This is also what I explained to Jan and Amir in earlier
>>> discussion.
>>
>> And I also agree. But the fact that it is not trivial does not mean that it
>> should not be done...
>>
>
> I am currently working on directed or remote memcg charging for a
> different usecase and I think that would be helpful here as well.
>
> I have two questions though:
>
> 1) Is fsnotify_group the right structure to hold the reference to
> target mem_cgroup for charging?

I think it is. The process who set up the group and determined the unlimited
events queue size and did not consume the events from the queue in a timely
manner is the process to blame for the OOM situation.

> 2) Remote charging can trigger an OOM in the target memcg. In this
> usecase, I think, there should be security concerns if the events
> producer can trigger OOM in the memcg of the monitor. We can either
> change these allocations to use __GFP_NORETRY or some new gfp flag to
> not trigger oom-killer. So, is this valid concern or am I
> over-thinking?
>

I think that is a very valid concern, but not sure about the solution.
For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
(group->priority == 0), I think it makes sense to let oom-killer kill
the listener which is not keeping up with consuming events.

For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
(group->priority > 0) allowing an adversary to trigger oom-killer
and kill the listener could bypass permission event checks.

So we could use different allocation flags for permission events
or for groups with high priority or for groups that have permission
events in their mask, so an adversary could not use non-permission
events to oom-kill a listener that is also monitoring permission events.

Generally speaking, permission event monitors should not be
setting FAN_UNLIMITED_QUEUE and should not be causing oom
(because permission events are not queued they are blocking), but
there is nothing preventing a process to setup a FAN_CLASS_CONTENT
group with FAN_UNLIMITED_QUEUE and also monitor non-permission
events.

There is also nothing preventing a process from setting up one
FAN_CLASS_CONTENT listener for permission events and another
FAN_CLASS_NOTIF listener for non-permission event.
Maybe it is not wise, but we don't know that there are no such processes
out there.

I know I am throwing a lot of confusing information, but here is an argument
that might be simple enough:

If a process has setup a group with permission event mask,
then the administrator has granted this process the permission to
bring the system to a halt even if the process misbehaves
(because better halt the system than let it be compromised).
If that interpretation is acceptable then oom-killing a process
which has setup a group with permission event mask is not
acceptable.

Thanks,
Amir.