Fault tolerance/bad patch, [was Re: [PATCH 29/30] [PATCH] PCI Hotplug: fake NULL pointer dereferences in IBM Hot Plug Controller Driver]

From: Linas Vepstas
Date: Fri Jun 23 2006 - 11:03:40 EST


Hi,

On Mon, Jun 19, 2006 at 02:43:35PM -0700, Greg KH wrote:
> From: Eric Sesterhenn <snakebyte@xxxxxx>
>
> Remove checks for value, since the hotplug core always provides
> a valid value.
>
> - if (hotplug_slot && value) {
> + if (hotplug_slot) {

This may be the wrong place to bring up a philosphical issue,
but this example was just too good to pass up. This patch violates
the general dictates of high-reliability/fault-tolerant programming.

If someone in the future changes the hotplug core so that it
sometimes returns a null value, this code will potentially crash
and/or do other bad things (corrupt, invalid state, etc.)
This means that this routine will no longer be "robust" in the face of
changes in other parts of the kernel.

I can hear the objections:
-- Performance. B.S. This routine is not performance critical, it will
get called once a week, once a month or less often; a few extra
cycles are utterly irrelevant.

-- Many eyes/shallow bugs. A patch that breaks things would be rejected.
B.S. Patches that break things get into the kernel all the time.

-- If its broken, testing will find it and fix it. B.S. The hotplug
routines are lightly tested, infrequently tested. There's not much
hardware that does hotplug, and its expensive. Home enthsiasts
won't find this.

Worse, the null-value condition could be a rare corner case that won't
show up during "normal" operation, i.e. during "normal" hotplug testing.
It may only get triggered in some obscuare case e.g. bad pci card
or sysadmin error, or due to error in a user-land tool.

Hotplug ops are rare; they typically are not done until they are needed
(e.g. because the card failed), and that is a really bad time to discover
that you've crashed the kernel. The whole point of hotplug is to *not*
have to reboot.

(If I may blather and pompously pontificate some more: This is also
why Linux could never be used in life-critical/safety-critical apps,
like health care, machine control, automoitive, aviation, satellites,
or the Mars rover. Code that controls life-critical apps has zillions
of safety checks for situations that "can't possibly happen". But
every insider knows "can't happen" does happen: take the Hyabusa
asteroid mission as a recent example. My goal in pontificating is not
to reject this one patch, but to change the cowboy attitude that makes
people think that patches like this are OK.)

Thus, I can't begin to imagine why anyone would want to remove
robustness with a patch like this, and gain absolutely nothing in
return.


--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/