Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

From: Michal Hocko
Date: Fri Aug 19 2016 - 05:01:10 EST


On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> Hi Michal,
>
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > >> than let the kernel's OOM killer activate and need to gather this
> > > >> information and we'd like to be able to get this information to make
> > > >> the decision much faster than 400ms
> > > >
> > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > understand you want something better than SIGKILL and in fact this is
> > > > already possible with memory cgroup controller (btw. memcg will give
> > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > well). Anyway if you are getting close to the OOM your system will most
> > > > probably be really busy and chances are that also reading your new file
> > > > will take much more time. I am also not quite sure how is pss useful for
> > > > oom decisions.
> > >
> > > I mentioned it before, but based on experience RSS just isn't good
> > > enough -- there's too much sharing going on in our use case to make
> > > the correct decision based on RSS. If RSS were good enough, simply
> > > put, this patch wouldn't exist.
> >
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
>
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
>
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
>
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
>
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.

So far it doesn't really seem that the parsing is the biggest problem.
The major cycles killer is the output formatting and that doesn't sound
like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.

> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.

That might confuse existing parsers. Besides that we already have
/proc/<pid>/statm which gives cumulative numbers already. I am not sure
how often it is used and whether the pte walk is too expensive for
existing users but that should be explored and evaluated before a new
file is created.

The /proc became a dump of everything people found interesting just
because we were to easy to allow those additions. Do not repeat those
mistakes, please!
--
Michal Hocko
SUSE Labs