Re: RFC: Memory Tiering Kernel Interfaces (v2)

From: Aneesh Kumar K V
Date: Fri May 27 2022 - 05:27:31 EST


On 5/25/22 10:57 PM, Wei Xu wrote:
On Wed, May 25, 2022 at 3:01 AM Aneesh Kumar K V
<aneesh.kumar@xxxxxxxxxxxxx> wrote:

On 5/25/22 2:33 PM, Ying Huang wrote:
On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@xxxxxxxxx> wrote:

On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@xxxxxxxxx> wrote:


...


OK. Just to confirm. Does this mean that we will have fixed device ID,
for example,

GPU memtier255
DRAM (with CPU) memtier0
PMEM memtier1

When we add a new memtier, it can be memtier254, or memter2? The rank
value will determine the real demotion order.

I think you may need to send v3 to make sure everyone is at the same
page.


What we have implemented which we will send as RFC shortly is below.

cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
/sys/devices/system
kvaneesh@ubuntu-guest:/sys/devices/system$ ls
clockevents clocksource container cpu edac memory memtier mpic
node power
kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
/sys/devices/system/memtier
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
default_rank max_rank memtier1 power uevent
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
1
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
3

For flexibility, we don't want max_rank to be interpreted as the
number of memory tiers. Also, we want to leave spaces in rank values
to allow new memtiers to be inserted when needed. So I'd suggest to
make max_rank a much larger value (e.g. 255).

kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
nodelist power rank subsystem uevent
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
0-3
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
1
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd
../../node/node1/
kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
1
kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
0
root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
root@ubuntu-guest:/sys/devices/system/memtier# ls
default_rank max_rank memtier0 memtier1 power uevent
root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
1
root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
0

It looks like the example here demonstrates the dynamic creation of
memtier0. If so, how is the rank of memtier0 determined? If we want
to support creating new memtiers at runtime, I think an explicit
interface that specifies both device ID and rank is preferred to avoid
implicit dependencies between device IDs and ranks.


Right now to keep it all simpler there is a 1:1 relation ship between memory tier and rank value. ie.

memory tier rank
memtier0 100
memtier1 200
memtier2 300

Currently we are limiting this to max 3 tiers. Hence the above is very easy. Once we really get dynamic tier creation, we should be looking at creating a new memory tier with highest possible rank value. Once we establish the memory tier, we then modify the rank value to a desired value. There will be a kernel interface to add a node to a memory tier with specific rank value so drivers can do that if required.

I haven't gone to that implementation because i was hoping we could get to that later when we really start requiring dynamic tier support.

I will share the patch series we have been working with. I am yet to get the documentation added. But then i will not wait for it to be complete so that we can get some early testing/feedback.

-aneesh