Re: [PATCH v1 00/12] netoops support

From: Mike Waychison
Date: Thu Nov 04 2010 - 13:38:26 EST

Next message: André Luis Pereira dos Santos - BSRSoft: "[PATCH 1/1] security: Reordering the boot message security framework 2.6.37-rc1"
Previous message: Herbert Xu: "[PATCH 3/4] crypto: algif_hash - User-space interface for hash operations"
In reply to: AmÃrico Wang: "Re: [PATCH v1 00/12] netoops support"
Next in thread: AmÃrico Wang: "Re: [PATCH v1 00/12] netoops support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Américo Wang wrote:

On Wed, Nov 03, 2010 at 06:18:41PM -0700, Mike Waychison wrote:
Matt Mackall wrote:
On Wed, 2010-11-03 at 13:29 -0700, Mike Waychison wrote:
Mike Waychison wrote:
FWIW, another semantic difference between netconsole and netoops (that
I had missed in the last email) is filtering: we really do want to get
the whole log when a crash happens, debug messages and all.
Netconsole is subject to console filtering (which we _do_ want as
debug messages going out the uart slows the whole world down).

netconsole and netoops _do_ have bits in common, for instance the
handling of NETDEV events and source+target configuration. I'd rather
those bits become common between the two than figure out how to jam
the semantics we need into netconsole.

Hi Matt,

I've been reading through the netconsole driver in response to
Greg's comments on this thread, and it is definitely more robust
in terms of configuration and handling of network device events
than the netoops driver I proposed.

I've been following the discussion to see if it went anywhere
interesting..

What are your thoughts on extending netconsole with the same sort
of semantics that are in the netoops patchset?

My first thought is that it's a bit unfortunate that some of the the
netconsole configgy bits weren't implemented in a generic way that would
be applicable to other netpoll clients. Some people have never gotten it
into their heads that netconsole isn't the only client.

I'd still like to have blit-dmesg-to-the-network-on-oops
semantics, which seems doable by having a per-target flag for
streaming of console messages (enabled by default) and a flag to
emit a structured full dmesg dump (disabled by default).

I'd actually like to see you go forward with netoops. It's clear to me
that it's a different beast and complexifying netconsole with a bunch of
weird new options doesn't really sit well. If that means abstracting
some of the sysfs crap from netconsole, great.

I'd be happy to take a stab at this. This solves most of the ABI
reservations that I have with this v1 patchset.

Looking at netconsole, it looks to lack some locking for data
consistency, and it appears that we will deadlock if we ever get a
NETDEV_UNREGISTER event (due to recursively grabbing the rtnl in
netpoll_cleanup). I have a couple patches I've been hacking on this
afternoon that should clear those issues up.

You might want to look at net-next-2.6, it has some fixes
from Neil.

Excellent, yes, 3b410a31 fixes the recursive rtnl deadlock I was referring to.

I'm thinking of pushing all the target handling options down into
net/core/netpoll.c. I'll probably expose this interface as "struct
netpoll_targets" where ->lock and ->list could be completely exposed
to clients. netconsole would then get a lot smaller as would
netoops.

That said, I don't think netoops is an ideal name, given how closely
bound oops _events_ are with their textual output. Presumably it covers
events other than oopsen like panics too.

True. We call this code 'netdump' or 'network_dumper' internally,
but I figured it'd be better to follow current conventions with
ramoops and mtdoops already in the tree. I don't really care what
it's called in the end :)

"netdump" was used by a utility that do crash dumping over net.
It is deprecated now, since we have kdump.

Yup. If you go back far enough, I think this was a gut of that code long long ago, hence the name.

Regarding rolling oopses: lots of machines regularly survive
oopses, so I think you ought to consider rate-limiting them (to a
configurable rate
with a very low default) rather than suppressing all but the first.

The trouble with Oopses is just that: We don't know whether we can
safely survive them or not and it's a total gamble each time we do
Oops. We can't programmatically know how crapped out the machine is,
so historically we've erred on not allowing bad things to continue
happening once someone notices something wrong.

It's easier for us to just shoot the machine in the head
(panic_on_oops) and move on than corrupt data or dead-lock in weird
ways at some later point in time. This is definitely not the
behaviour I would want nor expect from my desktop or phone, but for
the cluster, it's just safer.

We also have pause_on_oops, or we can invent a oops_once.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: André Luis Pereira dos Santos - BSRSoft: "[PATCH 1/1] security: Reordering the boot message security framework 2.6.37-rc1"
Previous message: Herbert Xu: "[PATCH 3/4] crypto: algif_hash - User-space interface for hash operations"
In reply to: AmÃrico Wang: "Re: [PATCH v1 00/12] netoops support"
Next in thread: AmÃrico Wang: "Re: [PATCH v1 00/12] netoops support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]