Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

From: Andy Nelson
Date: Fri Nov 04 2005 - 16:04:33 EST



Hi,

>can you think of any reason why the boot-time-configured hugetlb zone
>would be inadequate for your needs?

I am not enough of a kernel level person or sysadmin to know for certain,
but I have still big worries about consecutive jobs that run on the
same resources, but want extremely different page behavior. If what
you are suggesting can cause all previous history on those resources
to be forgotten and then reset to whatever it is that I want when I
start my run, then yes. It would be fine for me. In some sense, this is
perhaps what I was asking for in my original message when I was talking
about using batch schedulers, cpusets and friends to encapsulate
regions of resources, that could be reset to nice states at user
specified intervals, like when the batch scheduler releases one job
and another job starts.


The issues that I can still think of that hpc people will need are
(some points here are clearly related to each other, but anyway).


1) how do zones play with numa? Does setting up resource management this
way mean that various kernel things that help me access my memory
(hellifino what I'm talking about here--things like tables and lists
of pages that I own and how to access them etc I suppose--whatever
it is that kernels don't get rid of when someone else's job ends and
before mine starts) actually get allocated in some other zone half
way across the machine? This is going to kill me on latency grounds.
Can it be set up so that this reserved special kernel zone is somewhere
close by? If it is bigger than the next guy to get my resources wants,
can it be deleted and reset once my job is finished, so his job can run?
This is what I would hope for and expect that something like
cpuset/memsets would help to do.

2) How do zones play with merging small pages into big pages, splitting
big pages into small, or deleting whatever page environment was there
in favor of a reset of those resources to some initial state? If
someone runs a small page job right after my big page job, will
they get big pages? If I run a big page job right after their small
page job, will I get small pages?

In each case, will it simply say 'no can do' and die? If this setup
just means that some jobs can't be run or can't be run after
something else, it will not fly.

3) How does any sort of fall back scheme work? If I can't have all of my
big pages, maybe I'll settle for some small ones and some big ones.
Can I have them? If I can't have them and die instead, zones like
this will not fly.

Points 2 and 3 have mostly to do with the question Does the system
performance degrade over time for different constituencies of users
or can it stay up stably, serving everyone equally and well for a
long time?

4) How does any of this stuff play with interactive management? It is
not going to fly if sysadmins have to get involved on a
daily/regular basis, or even at much more than a cursory level of
turning something on once when the machine is purchased.

5) How does any of this stuff play with me having to rewrite my code to
use nonstandard language features? If I can't run using standard
fortran, standard C and maybe for some folks standard C++ or Java,
it won't fly.

6) what about text vs data pages. I'm talking here about executable
code vs whatever that code operates on. Do they get to have different
sized pages? Do they get allocated from sensible places on the
machine, as in reasonably separate from each other but not in some
far away zone over the rainbow?

7) If OS's/HW ever get decent support for lots and lots of page sizes
(like mips and sparc now) rather than a couple , will the
infrastructure be able to give me whichever size I ask for, or will
I only get to choose between a couple, even if perhaps settable at
boot time? Extensibility like this will be a requirement long term
of course.

8) What if I want 32 cpus and 64GB of memory on a machine, get it,
finish using it, and then the next jobs in line request say 8 cpus
and 16GB of memory, 4cpus and 16GB of memory, 20 cpus and 4GB
of memory? Will the zone system be able to handle such dynamically
changing things?


What I would need to see is that these sorts of issues can be handled
gracefully by the OS, perhaps with the help of some user land or
priveleged userland hints that would come from things like the batch
scheduler or an env variable to set my prefered page size or other
things about memory policy.


Thanks,

Andy

PS to Linus: I have secured access to an dual cpu dual core amd box.
I have to talk to someone who is not here today to see about turning
on large pages. We'll see how that goes probably some time next week.
If it is possible, you'll see some benchmarks then.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/