RE: [PATCH 4/4] zone_reclaim_mode is always 0 by default

From: Zhang, Yanmin
Date: Mon May 18 2009 - 23:39:47 EST


>>-----Original Message-----
>>From: KOSAKI Motohiro [mailto:kosaki.motohiro@xxxxxxxxxxxxxx]
>>Sent: 2009年5月19日 10:54
>>To: Wu, Fengguang
>>Cc: kosaki.motohiro@xxxxxxxxxxxxxx; LKML; linux-mm; Andrew Morton; Rik van
>>Riel; Christoph Lameter; Zhang, Yanmin
>>Subject: Re: [PATCH 4/4] zone_reclaim_mode is always 0 by default
>>
>>> On Wed, May 13, 2009 at 12:08:12PM +0900, KOSAKI Motohiro wrote:
>>> > Subject: [PATCH] zone_reclaim_mode is always 0 by default
>>> >
>>> > Current linux policy is, if the machine has large remote node distance,
>>> > zone_reclaim_mode is enabled by default because we've be able to assume

>>
>>ok, I would explain zone reclaim design and performance tendency.
>>
>>Firstly, we can make classification of linux eco system, roughly.
>> - HPC
>> - high-end server
>> - volume server
>> - desktop
>> - embedded
>>
>>it is separated by typical workload mainly.
>>
>>Secondly, zone_reclaim mean "I strongly dislike remote node access than
>>disk access".
>>it is very fitting on HPC workload. it because
>> - HPC workload typically make the number of the same as cpus of processess
>>(or thread).
>> IOW, the workload typically use memory equally each node.
>> - HPC workload is typically CPU bounded job. CPU migration is rare.
>> - HPC workload is typically long lived. (possible >1 year)
>> IOW, remote node allocation makes _very_ _very_ much remote node access.
>>
>>but zone_reclaim don't fit typical server workload.
>> - server workload often make thread pool and some thread is sleeping until
>> a request receved.
>> IOW, when thread waking-up, the thread might move another cpu.
>> node distance tendency don't make sense on weak cpu locality workload.
>>
>>Plus, disk-cache is the file-server's identity. we shouldn't think it's not
>>important.
>>Plus, DB software can consume almost system memory and (In general) RDB data
>>makes
>>harder to split equally as hpc.
>>
>>desktop workload is special. desktop peopole can run various workload beyond
>>our assumption. So, we shouldn't have any workload assumption to desktop
>>people.
>>However, AFAIK almost desktop software use memory as UMA.
>>
>>we don't need to care embedded. it is typically UMA.
>>
>>
>>IOW, the benefit of zone reclaim depend on "strong cpu locality" and
>>"workload is cpu bounded" and "thead is long lived".
>>but many workload don't fill above requirement. IOW, zone reclaim is
>>workload depended feature (as Wu said).
>>
>>
>>In general, the feature of workload depended don't fit default option.
>>we can't know end-user run what workload anyway.
>>
>>Fortunately (or Unfortunately), typical workload and machine size had
>>significant mutuality.
>>Thus, the current default setting calculation had worked well in past days.
[YM] Your analysis is clear and deep.

>>
>>Now, it was breaked. What should we do?
>>Yanmin, We know 99% linux people use intel cpu and you are one of
>>most hard repeated testing
[YM] It's very easy to reproduce them on my machines. :) Sometimes, because the
issues only exist on machines with lots of cpu while other community developers
have no such environments.

guy in lkml and you have much test.
>>May I ask your tested machine and benchmark?
[YM] Usually I started lots of benchmark testing against the latest kernel, but
as for this issue, it's reported by a customer firstly. The customer runs apache
on Nehalem machines to access lots of files. So the issue is an example of file
server.

BTW, I found many test cases of fio have big drop after I upgraded BIOS of one
Nehalem machine. By checking vmstat data, I found almost a half memory is always free. It's also related to zone_reclaim_mode because new BIOS changes the node
distance to a large value. I use numactl --interleave=all to walkaround the problem temporarily.

I have no HPC environment.

>>
>>if zone_reclaim=0 tendency workload is much than zone_reclaim=1 tendency
>>workload,
>> we can drop our afraid and we would prioritize your opinion, of cource.
So it seems only file servers have the issue currently.

Yanmin

N?叉??y??b??千v??藓{.n???{?赙zXФ?塄}?财??j:+v???赙zZ+€?zf"?????i????ア??璀??撷f?^j谦y??@A?囤?0鹅h??i