Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg

From: Amir Goldstein
Date: Sun Oct 22 2017 - 04:24:31 EST


On Sat, Oct 21, 2017 at 12:07 AM, Yang Shi <yang.s@xxxxxxxxxxxxxxx> wrote:
>
>
> On 10/19/17 8:14 PM, Amir Goldstein wrote:
>>
>> On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@xxxxxxxxxxxxxxx> wrote:
>>>
>>> We observed some misbehaved user applications might consume significant
>>> amount of fsnotify slabs silently. It'd better to account those slabs in
>>> kmemcg so that we can get heads up before misbehaved applications use too
>>> much memory silently.
>>
>>
>> In what way do they misbehave? create a lot of marks? create a lot of
>> events?
>> Not reading events in their queue?
>
>
> It looks both a lot marks and events. I'm not sure if it is the latter case.
> If I knew more about the details of the behavior, I would elaborated more in
> the commit log.

If you are not sure, do not refer to user application as "misbehaved".
Is updatedb(8) a misbehaved application because it produces a lot of access
events?
It would be better if you provide the dry facts of your setup and slab counters
and say that you are missing information to analyse the distribution of slab
usage because of missing kmemcg accounting.


>
>> The latter case is more interesting:
>>
>> Process A is the one that asked to get the events.
>> Process B is the one that is generating the events and queuing them on
>> the queue that is owned by process A, who is also to blame if the queue
>> is not being read.
>
>
> I agree it is not fair to account the memory to the generator. But, afaik,
> accounting non-current memcg is not how memcg is designed and works. Please
> see the below for some details.
>
>>
>> So why should process B be held accountable for memory pressure
>> caused by, say, an FAN_UNLIMITED_QUEUE that process A created and
>> doesn't read from.
>>
>> Is it possible to get an explicit reference to the memcg's events cache
>> at fsnotify_group creation time, store it in the group struct and then
>> allocate
>> events from the event cache associated with the group (the listener)
>> rather
>> than the cache associated with the task generating the event?
>
>
> I don't think current memcg design can do this. Because kmem accounting
> happens at allocation (when calling kmem_cache_alloc) stage, and get the
> associated memcg from current task, so basically who does the allocation who
> get it accounted. If the producer is in the different memcg of consumer, it
> should be just accounted to the producer memcg, although the problem might
> be caused by the producer.
>
> However, afaik, both producer and consumer are typically in the same memcg.
> So, this might be not a big issue. But, I do admit such unfair accounting
> may happen.
>

That is a reasonable argument, but please make a comment on that fact in
commit message and above creation of events cache, so that it is clear that
event slab accounting is mostly heuristic.

But I think there is another problem, not introduced by your change, but could
be amplified because of it - when a non-permission event allocation fails, the
event is silently dropped, AFAICT, with no indication to listener.
That seems like a bug to me, because there is a perfectly safe way to deal with
event allocation failure - queue the overflow event.

I am not going to be the one to determine if fixing this alleged bug is a
prerequisite for merging your patch, but I think enforcing memory limits on
event allocation could amplify that bug, so it should be fixed.

The upside is that with both your accounting fix and ENOMEM = overlflow
fix, it going to be easy to write a test that verifies both of them:
- Run a listener in memcg with limited kmem and unlimited (or very
large) event queue
- Produce events inside memcg without listener reading them
- Read event and expect an OVERFLOW event

This is a simple variant of LTP tests inotify05 and fanotify05.

I realize that is user application behavior change and that documentation
implies that an OVERFLOW event is not expected when using
FAN_UNLIMITED_QUEUE, but IMO no one will come shouting
if we stop silently dropping events, so it is better to fix this and update
documentation.

Attached a compile-tested patch to implement overflow on ENOMEM
Hope this helps to test your patch and then we can merge both, accompanied
with LTP tests for inotify and fanotify.

Amir.
From 112ecd54045f14aff2c42622fabb4ffab9f0d8ff Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@xxxxxxxxx>
Date: Sun, 22 Oct 2017 11:13:10 +0300
Subject: [PATCH] fsnotify: queue an overflow event on failure to allocate
event

In low memory situations, non permissions events are silently dropped.
It is better to queue an OVERFLOW event in that case to let the listener
know about the lost event.

With this change, an application can now get an FAN_Q_OVERFLOW event,
even if it used flag FAN_UNLIMITED_QUEUE on fanotify_init().

Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
---
fs/notify/fanotify/fanotify.c | 10 ++++++++--
fs/notify/inotify/inotify_fsnotify.c | 8 ++++++--
fs/notify/notification.c | 3 ++-
3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 2fa99aeaa095..412a32838f58 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -212,8 +212,14 @@ static int fanotify_handle_event(struct fsnotify_group *group,
mask);

event = fanotify_alloc_event(inode, mask, data);
- if (unlikely(!event))
- return -ENOMEM;
+ if (unlikely(!event)) {
+ if (mask & FAN_ALL_PERM_EVENTS)
+ return -ENOMEM;
+
+ /* Queue an overflow event on failure to allocate event */
+ fsnotify_add_event(group, group->overflow_event, NULL);
+ return 0;
+ }

fsn_event = &event->fse;
ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 8b73332735ba..d1837da2ef15 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -99,8 +99,11 @@ int inotify_handle_event(struct fsnotify_group *group,
fsn_mark);

event = kmalloc(alloc_len, GFP_KERNEL);
- if (unlikely(!event))
- return -ENOMEM;
+ if (unlikely(!event)) {
+ /* Queue an overflow event on failure to allocate event */
+ fsnotify_add_event(group, group->overflow_event, NULL);
+ goto oneshot;
+ }

fsn_event = &event->fse;
fsnotify_init_event(fsn_event, inode, mask);
@@ -116,6 +119,7 @@ int inotify_handle_event(struct fsnotify_group *group,
fsnotify_destroy_event(group, fsn_event);
}

+oneshot:
if (inode_mark->mask & IN_ONESHOT)
fsnotify_destroy_mark(inode_mark, group);

diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 66f85c651c52..5abd69976a47 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -111,7 +111,8 @@ int fsnotify_add_event(struct fsnotify_group *group,
return 2;
}

- if (group->q_len >= group->max_events) {
+ if (group->q_len >= group->max_events ||
+ event == group->overflow_event) {
ret = 2;
/* Queue overflow event only if it isn't already queued */
if (!list_empty(&group->overflow_event->list)) {
--
2.7.4