Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)

From: David Rientjes
Date: Mon Aug 10 2015 - 20:38:18 EST


On Fri, 7 Aug 2015, Naoya Horiguchi wrote:

> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
>
> before patch:
>
> Size: 20480 kB
> Rss: 0 kB
> Pss: 0 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>
> after patch:
>
> Size: 20480 kB
> Rss: 18432 kB
> Pss: 18432 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 18432 kB
> Referenced: 18432 kB
> Anonymous: 18432 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>

I think this will lead to breakage, unfortunately, specifically for users
who are concerned with resource management.

An example: we use memcg hierarchies to charge memory for individual jobs,
specific users, and system overhead. Memcg is a cgroup, so this is done
for an aggregate of processes, and we often have to monitor their memory
usage. Each process isn't assigned to its own memcg, and I don't believe
common users of memcg assign individual processes to their own memcgs.

When a memcg is out of memory, we need to track the memory usage of
processes attached to its memcg hierarchy to determine what is unexpected,
either as a result of a new rollout or because of a memory leak. To do
that, we use the rss exported by smaps that is now changed with this
patch. By using smaps rather than /proc/pid/status, we can report where
memory usage is unexpected.

This would cause our process that manages all memcgs on our systems to
break. Perhaps I haven't been as convincing in my previous messages of
this, but it's quite an obvious userspace regression.

This memory was not included in rss originally because memory in the
hugetlb persistent pool is always resident. Unmapping the memory does not
free memory. For this reason, hugetlb memory has always been treated as
its own type of memory.

It would have been arguable back when hugetlbfs was introduced whether it
should be included. I'm afraid the ship has sailed on that since a decade
has past and it would cause userspace to break if existing metrics are
used that already have cleared defined semantics.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/