Re: [ANNOUNCE] Minneapolis Cluster Summit, July 29-30

From: Daniel Phillips
Date: Sat Jul 10 2004 - 15:50:58 EST


On Saturday 10 July 2004 13:59, Steven Dake wrote:
> > I'm not saying you're wrong, but I can think of an advantage you
> > didn't mention: a service living in kernel will inherit the
> > PF_MEMALLOC state of the process that called it, that is, a VM
> > cache flushing task. A userspace service will not. A cluster
> > block device in kernel may need to invoke some service in userspace
> > at an inconvenient time.
> >
> > For example, suppose somebody spills coffee into a network node
> > while another network node is in PF_MEMALLOC state, busily trying
> > to write out dirty file data to it. The kernel block device now
> > needs to yell to the user space service to go get it a new network
> > connection. But the userspace service may need to allocate some
> > memory to do that, and, whoops, the kernel won't give it any
> > because it is in PF_MEMALLOC state. Now what?
>
> overload conditions that have caused the kernel to run low on memory
> are a difficult problem, even for kernel components. Currently
> openais includes "memory pools" which preallocate data structures.
> While that work is not yet complete, the intent is to ensure every
> data area is preallocated so the openais executive (the thing that
> does all of the work) doesn't ever request extra memory once it
> becomes operational.
>
> This of course, leads to problems in the following system calls which
> openais uses extensively:
> sys_poll
> sys_recvmsg
> sys_sendmsg
>
> which require the allocations of memory with GFP_KERNEL, which can
> then fail returning ENOMEM to userland. The openais protocol
> currently can handle low memory failures in recvmsg and sendmsg.
> This is because it uses a protocol designed to operate on lossy
> networks.
>
> The poll system call problem will be rectified by utilizing
> sys_epoll_wait which does not allocate any memory (the poll data is
> preallocated).

But if the user space service is sitting in the kernel's dirty memory
writeout path, you have a real problem: the low memory condition may
never get resolved, rendering your userspace service autistic.
Meanwhile, whoever is generating the dirty memory just keeps spinning
and spinning, generating more of it, ensuring that if the system does
survive the first incident, there's another, worse traffic jam coming
down the pipe. To trigger this deadlock, a kernel filesystem or block
device module just has to lose its cluster connection(s) at the wrong
time.

> I hope that helps atleast answer that some r&d is underway to solve
> this particular overload problem in userspace.

I'm certain there's a solution, but until it is demonstrated and proved,
any userspace cluster services must be regarded with narrow squinty
eyes.

> > Though I admit I haven't read through the whole code tree, there
> > doesn't seem to be a distributed lock manager there. Maybe that is
> > because it's so tightly coded I missed it?
>
> There is as of yet no implementation of the SAF AIS dlock API in
> openais. The work requires about 4 weeks of development for someone
> well-skilled. I'd expect a contribution for this API in the
> timeframes that make GFS interesting.

I suspect you have underestimated the amount of development time
required.

> I'd invite you, or others interested in these sorts of services, to
> contribute that code, if interested.

Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see
if you can hack it to do what you want. Just write a kernel module
that exports the DLM interface to userspace in the desired form.

http://sources.redhat.com/cluster/dlm/

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/