Re: 4.6.2 frequent crashes under memory + IO pressure

From: Tetsuo Handa
Date: Thu Jun 23 2016 - 07:26:54 EST


Johannes Stezenbach wrote:
> What is your opinion about older kernels (4.4, 4.5) working?
> I think I've seen some OOM messages with the older kernels,
> Jill was killed and I restarted the build to complete it.
> A full bisect would take more than a day, I don't think
> I have the time for it.
> Since I use dm-crypt + lvm, should we add more Cc or do
> you think it is an mm issue?

I have no idea.

> > > Below I'm pasting some log snippets, let me know if you like
> > > it so much you want more of it ;-/ The total log is about 1.7MB.
> >
> > Yes, I'd like to browse it. Could you send it to me?
>
> Did you get any additional insights from it?

I found

[ 2245.660712] DMA free:4kB min:32kB
[ 2245.707031] DMA32 free:0kB min:6724kB
[ 2245.757597] Normal free:24kB min:928kB
[ 2245.806515] DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 2245.816359] DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 2245.826378] Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

[ 2317.853951] DMA free:0kB min:32kB
[ 2317.900460] DMA32 free:0kB min:6724kB
[ 2317.951574] Normal free:0kB min:928kB
[ 2318.000808] DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 2318.010713] DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 2318.020767] Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

which completely depleted memory reserves. So, please try commit 78ebc2f7146156f4
("mm,writeback: don't use memory reserves for wb_start_writeback") on your 4.6.2
kernel. As far as I know, passing mem=4G option will do equivalent thing.

Since you think you saw OOM messages with the older kernels, I assume that the OOM
killer was invoked on your 4.6.2 kernel. The OOM reaper in Linux 4.6 and Linux 4.7
will not help if the OOM killed process was between down_write(&mm->mmap_sem) and
up_write(&mm->mmap_sem).

I was not able to confirm whether the OOM killed process (I guess it was java)
was holding mm->mmap_sem for write, for /proc/sys/kernel/hung_task_warnings
dropped to 0 before traces of java threads are printed or console became
unusable due to the "delayed: kcryptd_crypt, ..." line. Anyway, I think that
kmallocwd will report it.

> > It is sad that we haven't merged kmallocwd which will report
> > which memory allocations are stalling
> > ( http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx ).
>
> Would you like me to try it? It wouldn't prevent the hang, though,
> just print better debug ouptut to serial console, right?
> Or would it OOM kill some process?

Yes, but for bisection purpose, please try commit 78ebc2f7146156f4 without
applying kmallocwd. If that commit helps avoiding flood of the allocation
failure warnings, we can consider backporting it. If that commit does not
help, I think you are reporting a new location which we should not use
memory reserves.

kmallocwd will not OOM kill some process. kmallocwd will not prevent the hang.
kmallocwd just prints information of threads which are stalling inside memory
allocation request.