Re: committed memory, mmaps and shms

From: Michal Hocko
Date: Thu Mar 12 2015 - 08:41:01 EST


[CCing MM maling list]

On Wed 11-03-15 19:10:44, Marcos Dione wrote:
>
> Hi everybody. First, I hope this is the right list for such
> questions; I searched in the list of lists[1] for a MM specific one, but
> didn't find any. Second, I'm not subscribed, so please CC me and my other
> address when answering.
>
> I'm trying to figure out how Linux really accounts for memory, both
> globally and for each individual process. Most user's first approach to
> memory monitoring is running free (no pun intended):
>
> $ free
> total used free shared buffers cached
> Mem: 396895176 395956332 938844 0 8972 356409952
> -/+ buffers/cache: 39537408 357357768
> Swap: 8385788 8385788 0
>
> This reports 378GiB of RAM, 377 used; of those 8MiB in buffers,
> 339GiB in cache, leaving only 38Gib for processes (for some reason this

I am not sure I understand your math here. 339G in the cache should be
reclaimable (be careful about the shmem though). It is the rest which
might be harder to reclaim.

> value is not displayed, which should probably be a warning to what is to
> come); and 1GiB free. So far all seems good.
>
> Now, this machine has (at least) a 108 GiB shm. All this memory is
> clearly counted as cache. This is my first surprise. shms are not cache
> of anything on disk, but spaces of shared memory (duh); at most, their
> pages can end up in swap, but not in a file somewhere. Maybe I'm not
> correctly interpreting the meaning of (what is accounted as) cache.

shmem (tmpfs) is a in memory filesystem. Pages backing shmem mappings
are maintained in the page cache. Their backing storage is swap as you
said. So from a conceptual point of vew this makes a lot of sense.
I can completely understand why this might be confusing for users,
though. The value simply means something else.
I think it would make more sense to add something like easily
reclaimable chache to the output of free (pagecache-shmem-dirty
basically). That would give an admin a better view on immediatelly
re-usable memory.

I will skip over the following section but keep it here for the mm
mailing list (TL;DR right now).

> The next tool in the toolbox is ps:
>
> $ ps ux | grep 27595
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> osysops 27595 49.5 12.7 5506723020 50525312 ? Sl 05:20 318:02 otf_be v2.9.0.13 : FQ_E08AS FQ_E08-FQDSIALT #1 [processing daemon lib, msg type: undefined]
>
> This process is not only attached to that shm, it's also attached to
> 5TiB of mmap'ed files (128 LMDB databases), for a total of 5251GiB. For
> context, know that another 9 processes do the same. This tells me that
> shms and mmaps are counted as part of their virtual size, which makes
> sense. Of those, only 48GiB are resident... but a couple of paragraphs
> before I said that there were only 38GiB used by processes. Clearly some
> part of each individual process' RSS also counts at least some part of
> the mmaps. /proc/27595/smaps has more info:
>
> $ cat /proc/27595/smaps | awk 'BEGIN { count= 0; } /Rss/ { count = count + $2; print } /Pss/ { print } /Swap/ { print } /^Size/ { print } /-/ { print } END { print count }'
> [...]
> 7f2987e92000-7f3387e92000 rw-s 00000000 fc:11 3225448420 /instant/LMDBMedium_0000000000/data.mdb
> Size: 41943040 kB
> Rss: 353164 kB
> Pss: 166169 kB
> Swap: 0 kB
> [...]
> 7f33df965000-7f4f1cdcc000 rw-s 00000000 00:04 454722576 /SYSV00000000 (deleted)
> Size: 114250140 kB
> Rss: 5587224 kB
> Pss: 3856206 kB
> Swap: 0 kB
> [...]
> 51652180
>
> Notice that the sum is not the same as the one reported before; maybe
> because I took them in different points of time while redacting this
> mail. So this confirms that a process' RSS value includes shms and mmaps,
> at least the resident part. In the case of the mmaps, the resident part
> must be the part that currently sits on the cache; in the case of the
> shms, I suppose it's the part that has ever been used. An internal tool
> tels me that currently 24GiB of that shm is in use, but only 5 are
> reported as part of that process' RSS. Maybe is that process' used part?
>
> And now I reach to what I find more confusing (uninteresting values
> removed):
>
> $ cat /proc/meminfo
> MemTotal: 396895176 kB
> MemFree: 989392 kB
> Buffers : 8448 kB
> Cached: 344059556 kB
> SwapTotal: 8385788 kB
> SwapFree: 0 kB
> Mapped: 147188944 kB
> Shmem: 109114792 kB
> CommitLimit: 206833376 kB
> Committed_AS: 349194180 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 1222960 kB
> VmallocChunk: 34157188704 kB
>
> Again, values might vary due to timing. Mapped clearly includes Shmem
> but not mmaps; in theory 36GiB are 'pure' (not shm'ed, not mmap'ed)
> process memory, close to what I calculated before. Again, this is not
> segregated, which again makes us wonder why. Probably it's more like "It
> doesn't make sense to do it".
>
> Last but definitely not least, Committed_AS is 333GiB, close to the
> total mem. man proc says it's «The amount of memory presently allocated
> on the system. The committed memory is a sum of all of the memory which
> has been allocated by processes, even if it has not been "used" by them
> as of yet». What is not clear is if this counts or not mmaps (I think it
> doesn't, or it would be either 5TiB or 50TiB, depending on whether you
> count each attachment to each shm) and/or/neither shms (once, multiple
> times?). In a rough calculation, the 83 procs using the same 108GiB shm
> account for 9TiB, so at least it's not counting it multiple times.
>
> While we're at it, I would like to know what VmallocTotal (32TiB) is
> accounting. The explanation in man proc («Total size of vmalloc memory
> area.», where vmalloc seems to be a kernel internal function to «allocate
> a contiguous memory region in the virtual address space») means not much
> for me. At some point I thought it should be the sum of all VSSs, but
> that clocks at 50TiB for me, so it isn't. Maybe I should just ignore it.
>
> Short version:
>
> * Why 'pure' mmalloc'ed memory is ever reported? Does it make sense to
> talk about it?

This is simply private anonymous memory. And you can see it as such in
/proc/<pid>/[s]maps

> * Why shms shows up in cache? What does cache currently mean/hold?

Explained above I hope (it is an in-memory filesystem).

> * What does the RSS value means for the shms in each proc's smaps file?
> And for mmaps?

The amount of shmem backed pages mapped in to the user address space.

> * Is my conclusion about Shmem being counted into Mapped correct?

Mapped will tell you how much page cache is mapped via pagetable to a
process. So it is a subset of pagecache. same as Shmem is a subset. Note
that shmem doesn't have to be mapped anywhere (e.g. simply read a file
on tmpfs filesystem - it will be in the pagecache but not mapped).

> * What is actually counted in Committed_AS? Does it count shms or mmaps?
> How?

This depends on the overcommit configuration. See
Documentation/sysctl/vm.txt for more information.

> * What is VmallocTotal?

Vmalloc areas are used by _kernel_ to map larger physically
non-contiguous memory areas. More on that e.g. here
http://www.makelinux.net/books/lkd2/ch11lev1sec5. You can safely ignore
it.

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/