Re: [PATCH] netpoll can lock up on low memory.

From: Steven Rostedt
Date: Fri Aug 05 2005 - 21:34:15 EST


On Fri, 2005-08-05 at 18:53 -0700, Matt Mackall wrote:
> On Fri, Aug 05, 2005 at 08:23:55PM -0400, Steven Rostedt wrote:
[...]
> > If you need to really get the data out, then the design should be
> > changed. Have some return value showing the failure, check for
> > oops_in_progress or whatever, and try again after turning interrupts
> > back on, and getting to a point where the system can free up memory
> > (write to swap, etc). Just a busy loop without ever getting a skb is
> > just bad.
>
> Why, pray tell, do you think there will be a second chance after
> re-enabling interrupts? How does this work when we're panicking or
> oopsing where we most care? How does this work when the netpoll client
> is the kernel debugger and the machine is completely stopped because
> we're tracing it?

What I meant was to check for an oops and maybe then don't break out.
Otherwise let the system try to reclaim memory. Since this is locked
when the alloc_skb called with GFP_ATOMIC and fails.

>
> As for busy loops, let me direct you to the "poll" part of the name.
> It is in fact the whole point.

In the kernel I would think that a poll would probe for an event and let
the system continue if the event hasn't arrived. Not block all
activities until an event has arrived.

>
> > > > So even a long timeout would not do? So you don't even get a message to
> > > > the console?
> > >
> > > In general, there's no way to measure time here. And if we're
> > > using netconsole, what makes you think there's any other console?
> >
> > Why assume that there isn't another console? The screen may be used
> > with netconsole, you just lose whatever has been scrolled too far.
>
> Yes, there may be another console, but we should by no means depend on
> that being the case. We should in fact assume it's not.
>
> > > > > > Also, as Andi told me, the printk here would probably not show up
> > > > > > anyway if this happens with netconsole.
> > > > >
> > > > > That's fine. But in fact, it does show up occassionally - I've seen
> > > > > it.
> > > >
> > > > Then maybe what Andi told me is not true ;-)
> > > >
> > > > Oh, and did your machine crash when you saw it? Have you seen it with
> > > > the e1000 driver?
> > >
> > > No and no. Most of my own testing is done with tg3.
> > >
> >
> > If you saw the message and the system didn't crash, then that's proof
> > that if the driver is not working properly, you would have lock up the
> > system, and the system was _not_ in a state that it _had_ to get the
> > message out.
>
> Let me be more precise. I've seen it in the middle of an oops dump,
> where it complained, then made further progress, and then died. In
> other words, the code works. And I've since upped the pool size.

OK, this is more clear than what you said previously. When I asked if
the system crashed, I should have asked if the system was crashing. I
thought that you meant that you saw this in normal activity with no
oops.

So, if anything, this discussion has pointed out that the e1000 has a
problem with its netpoll. I wrote an earlier patch, but since I don't
own a e1000, someone will need to test it, or at least check to see if
it looks OK.

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/