On Fri 10-06-22 12:58:53, Christian König wrote:
[SNIP]
And just for the clarity. I have mentioned global oom event here but theI do realize this is a long term problem and there is a demand for some
solution at least. I am not sure how to deal with shared resources
myself. The best approximation I can come up with is to limit the scope
of the damage into a memcg context. One idea I was playing with (but
never convinced myself it is really a worth) is to allow a new mode of
the oom victim selection for the global oom event.
concept could be extended to per-memcg oom killer as well.
I am not claiming this is wrong per se. It is just an approximation andIt would be an opt inWell, what is so bad at the approach of giving each process holding a
and the victim would be selected from the biggest leaf memcg (or kill
the whole memcg if it has group_oom configured.
That would address at least some of the accounting issue because charges
are better tracked than per process memory consumption. It is a crude
and ugly hack and it doesn't solve the underlying problem as shared
resources are not guaranteed to be freed when processes die but maybe it
would be just slightly better than the existing scheme which is clearly
lacking behind existing userspace.
reference to some shared memory it's equal amount of badness even when the
processes belong to different memory control groups?
it can surely be wrong in some cases (e.g. in those workloads where the
share memory is mostly owned by one process while the shared content is
consumed by many).
The primary question is whether it actually helps much or what kind of
scenarios it can help with and whether we can actually do better for
those.
Also do not forget that shared file memory is not the only thing
to care about. What about the kernel memory used on behalf of processes?
Just consider the above mentioned memcg driven model. It doesn't really
require to chase specific files and do some arbitrary math to share the
responsibility. It has a clear accounting and responsibility model.
It shares the same underlying problem that the oom killing is not
resource aware and therefore there is no guarantee that memory really
gets freed. But it allows sane configurations where shared resources do
not cross memcg boundaries at least. With that in mind and oom_cgroup
semantic you can get at least some semi-sane guarantees. Is it
pefect? No, by any means. But I would expect it to be more predictable.
Maybe we can come up with a saner model, but just going with per file
stats sounds like a hard to predict and debug approach to me. OOM
killing is a very disruptive operation and having random tasks killed
just because they have mapped few pages from a shared resource sounds
like a terrible thing to debug and explain to users.
If you really think that this would be a hard problem for upstreaming weSay we ignore the memcg side of things for now. How does it help long
could as well keep the behavior for memcg as it is for now. We would just
need to adjust the paramters to oom_badness() a bit.
term? Special casing the global oom is not all that hard but any future
change would very likely be disruptive with some semantic implications
AFAICS.