Re: (PLEASE READ THIS) Re: weird 3c590 problems

Peter T. Breuer (ptb@it.uc3m.es)
Mon, 27 Apr 1998 16:45:43 +0200 (MET DST)


"A month of sundays ago Michael H. Warfield wrote:"
>
> Sorry for interjecting here but this hits one of my button pushes...
>
> Peter T. Breuer enscribed thusly:
>
> > "A month of sundays ago David S. Miller wrote:"
> > >
> > > From: "Peter T. Breuer" <ptb@it.uc3m.es>
> > >
> > > I have been following every driver since 0.30all and I dropped in 0.99
> > > yesterday without making any difference. kernel is 2.0.25.
> > >
> > > No chance to ever try a newer kernel? The TCP has many problems, both
>
> > I can't take chances in a production environment. Any single failure
> > would take days to locate and require enormous work in updates whilst

> We've been using Linux for many of our critical systems at
> Internet Security Systems, for the last three years. I've been personally
> using it at other sites (including security firewalls) for about five years.
>
> I've got an engineering department with several dozen engineers plus
> QA people, support people, etc, etc, etc... I've never lost "days to locate"

I agree. Unfortunately it is not that _simple_ here. All our machines
are required to teach both windows and unix based programs. That means
for one thing that I have to have dmsdos available for the kernel. That
means at once that I have to patch. Dmsdos does not come in patchable
chunks. I have to revert every version of dmsdos, isolate the
interactions between it and other things, then patch forward again.
I can't learn new kernel code that fast. I have to stick with the code
I know. Then there is nt fs support. Then there is the fact that we run atm
cards. OK - anybody seen atm support in a mainstream kernel yet? Nope.
OK, I patch. Then there is the sound support - awe64 in a mainstream
kernel yet? There are other things needed too that I don't recall
offhand.

It is _clearly_ the better choice here to maintain the stable kernel
code base that we have and patch it minimally to support new features
as we need them. Every effort is made to NOT put into the kernel base
anything and to have it all loaded at modules, but it is not easy. Sure
maybe I broke something, but I can tell you if I did. I can just check
the changes.

What I want to avoid is throwing new code in slap bang wallop. I am
considering going to 2.0.28, and maybe to 2.0.32. I never saw any
evidence that any other kernel was more stable than what I already had.

If my tcp stack is broken - and it may be - tell me why the same code
and card and net works fine in a PP200 but not in a P200? I have a
theory (could the first foof bug fix perhaps wiork well in a 686 but
not in a 586?).

> enormous efforts into backporting and are taking serious risks in a
> production environment to avoid something that I have not seen arise in
> the entire time I've been using Linux in production and in engineering
> environments. And I make sure I keep up to date with the latest stable

I respect those views, but similarly, I have seen the results of going
with the newest kernels and systems. There are other departmenst in the
U. here with different distributions and kernels. I even teach in
their labs. Their installations are broken. Mine is not. Their labs
take 20 minutes to load netscape from a loaded scsi server nfs mounting
40 different systems via redhat 4.2 (insanely mounted via bootp and an
nfs root). Mine take the usual 10s. Their labs have broken dosemu. Mine
has working, etc. They have constant breakdowns. I have huge uptimes.
They lose entire fs's (hey, that was my work ..!). I do not.

> kernels on those systems running production kernels.

> I do patch. I patched for ping'o'death and I patched for teardrop and
> now nestea. Last I looked, the prerelease kernels were at 2.0.34-pre11 but

Nestea? Oh oh.

> When a production kernel comes out, I get it installed on a couple
> of our high stress systems immediately, retaining a history of previous

Ditto. But that isn't the same as having a troupe of monkeys trying it
all the time, which is what a production environment really is.

> kernels as rapid fall backs (rarely needed). Once the new kernel has
> performed in our environment for a few days (which some days is a toxic

A few days! I try them for weeks! Then slowly percolate the changes out
wider and wider. At each stage new problems show up. Usually they are
solved by upgrades of other stuff. Occasionally they result in
retracts.

> The 2.0.33 kernel has been out since December. If it wasn't stable, I
> think we would know about it by now. The only versions we run at less that

I don't agree. I had the pleasure of installing redhat 4.2 on a friends
machine last friday. They still haven't corrected the fact that more
than 9 hda partitions cause the installation to fail, the instructions
are still incomprehensible (why the heck is there a message about
"modules" in the init screen, when an installer won't know what they
are? Which directory are we supposed to go to in an ftp install?).
There are misspellings all over the place. They haven't corrected for
24x cdroms, etc. etc. I.e. my conclusion is that you can get a kernel
or distribution out there for ages and nobody is going to complain about
errors. Most people won't know they have a fault. The experts will fix
it themsleves.

> 2.0.33 are test systems we want to deliberately blow up (we are a security
> company). It's really pretty amazing that occasionally we still run into

OK.

> > Heres the current patches list to my 2.0.25 kernel:
>
> > oboe:/usr/src/linux/patches% cd ../init/patches/
> > output-patch memory-leak ufs raid
> > apm-fix pnp-sb byteorder qnx
> > buffer-correction eepro100 foof gcc-2.8.0
> > aic7xxx 3c509 ntfs atm
> > mailcfg 3c59x-options pci-update 2048
> > tcp-close stack autofs 3c59x
> > e2compr frag-overlap dmsdos ide-floppy+scsi
> > fat32 removable-scsi lm78
> > menuconfig nfs-locking paride
> > csum sound_modules paudio
>
> Good grief! I thought you said you couldn't take chances in a
> production environment! The last thing you want in a production environment
> is a patchwork quilt of "a litttle bit of this and a little bit of that".
> After a while you will eventually find some little interaction between this
> patch or that patch or you will find that a patch for 2.0.33 doesn't quite

I know. And are you telling me those things don't exist in the
production kernel? If you look, you will see that almost all those are
module based add-ons. The rest are extremely localized, except for the
things I am forced to add on. Looking at that list, I only see

memory-leak, stack, foof, frag-overlap, csum, buffer-correction and qnx

as fundamental changes. All except qnx are small. I am worried about
stack and foof and frag-overlap. qnx was an adventure without any
rationale, but I love it. The change was only to the task scheduler, so it
is "conceptually" small.

> do the same thing that it was designed to do or what you expect it to.
> We each run production environments different from one another, but if
> I caught any of my engineers building up a toxic waste dump of patches
> like this and not upgrading to the latest kernel, he would have some

But these patches have been looked at and verified by ME. The real code
has been looked at and hacked by at least 10 different very competent
people. I am not as competent as the authors in their own code, but I
communicate with almost every one of them, and I have contributed
code in most of the non-networking areas. I am probably about 1/2 as
competent as each of them in their own domain, and that's good enough
to be able to spot some things. And most of all, I am very aware of
the dangers. I do not alter basic kernel structures. I do not alter
their length, the order of elements, etc. etc. I know what to look
for. If you want to read papers on software maintenance, read some of
mine ...

> very serious questions to answer.

And I can answer them.

> I can't totally be sure from the list you've provided, but some
> of the names imply that you might have included some of Solar Designers
> security patches (stack - is that his non-executable stack patch or is that
> something else from another version?). So, you see, your terse names may

Yes. That is the non-exec stack patch.

> not even give a good indication of just what you have done on that kernel.

True. The contents of these say more.

> I get into debates like this on a weekly basis with our MarketDroids
> over patching vs upgrading for our products. Patching is fine for fast
> TEMPORARY fixes for immediate problems. Patching must ALWAYS be followed

I agree.

> by appropriate upgrades in a timely maner as upgrades become available or
> your production environment becomes an unsupported and unsupportable
> environment. You are there now. You have a problem and the description

But I can't go forward without losing the patches that are STILL not
available in production kernels. that is a lot of work.

> of your configuration indicates that it is impractical to try and support
> the configuration you've got. I'm sure Dave is very good, but I doubt he's

Well, let's see. I am going to try and isolate this. At the moment it
is either tcp stack that I've broken (and we can spot that), or it is
hardware (and we can spot that).

> much in the telepath department. I doubt it would be worth his time to
> assemble such a configuration to track down a problem which could be already

True.

> I can also quote you horror story after horror story of administrators
> who have done exactly what you have done and paid the price. Support people
> refer to some of these calls as "why on my shift" calls. We had one such

Of course. It's hard. I wouldn't support wholesale changes made to my
code that way. But I am willing to try this with another kernel. Before
crying wolf, let me isolate this. I already pointed out that changing
only the MOTHERBOARD + CPU fixes the problem. Now let's narrow it down.

> his configuration was screwed? Who knows... Who cares... He wasted a
> load of our time and his time for trying to nickle and dime patches and not

Whilst my stuff may not be right, it's not wrong. Yes, I know about
support. You don't have to tell me.

> > > correctness and performance wise, in such an old kernel.

> next comment is "can you fix it in the version I've got"? "No. The fix is
> to upgrade to the latest version." (Note this is a "free" upgrade we are
> talking about here) "No. I want it fixed in this version". "There is no

Let me try it with the latest kernel. If it works, has the problem gone
away? Or is it that the latest kernel has a bug that prevents my bug
being manifested? Who knows.

> > 1) I were not using the 2.0.* 3c905 drivers specially written by donald
> > FOR the 2.0.* series
> > 2) the same drivers work perfectly in the same kernels in different
> > mobos on 10BT nets, as far as I can see. I have them installed in
> > over 100 machines without problems. It's the 100BT net that hurts.
> > 3) it is not a question of "performance". The machines are only
> > transmitting at 1.5Mb/s (mega BITs) in tcp on the 100BT net. That's
> > a disaster, not a performance issue. They work fine (i.e. 8.5Mb/s)
> > on a 10BT net. Same mobo, same kernel, same driver.

> If you've got a system were you can run tests on and not interrupt
> production then I would assume you've got a system you can test the

I agree. But I have to write classes for tomorrow. And The techies
can't tie shoelaces together, still less make reliable observations.
So please wait. I have been able to make tests only at weekends.

> latest kernel on. You can always retain previous kernels and switch
> back and forth just with a reboot. Sooner or later, you are going to

I have several kernels available. I agree.

> to the latest kernel. Best do it now while you have some control over
> the situation rather than later with your users nipping at your heels
> because some patch that you need doesn't work with the patchwork kernel

I already have those problems - of course. That's where the work comes
in. The atm patches separated from the kernel I have about a year ago.
I have had to mirror the development since then.

> > > Later,
> > > David S. Miller
> > > davem@dm.cobaltmicro.com
>
> > Peter
>
> Mike
> --
> Michael H. Warfield | (770) 985-6132 | mhw@WittsEnd.com
> (The Mad Wizard) | (770) 925-8248 | http://www.wittsend.com/mhw/
> NIC whois: MHW9 | An optimist believes we live in the best of all
> PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu