[cgl_discussion] RE: Some Initial Comments on DDH-Spec-0.5h.pdf

Tue Sep 24 16:54:22 PDT 2002

> Problem Statement:
> =================
> Drivers fail and have bugs. The bugs are fixed, new features are added,
> the cycle repeats.
> 
> We are trying to do two things with the current doc:
> 
> 1.  Provide guidelines to help the driver creation process such that
>     drivers stabalize faster and that trouble-case errors are called
>     out to developers so they watch for them.
> 
> 2.  In the event of a failure or indication that a failure will occur,
>     enable drivers and user space applications to take actions to reduce
>     the amount of time required for recovery.

Much better.  These are clear statements.  So, do we take a "done" on
The Guidelines when the edited version appears in kernel/Documentation?
;^)

The thought also crossed my mind that with some heavy lifting in
#define's, one might be able to throw a config switch that would flag
some, but not all (think pointer indirection into a memory-mapped
device) driver idioms that go against the guidelines.  Some might
consider it "training wheels for driver writers", others might consider
it a useful tool.

Breaking the problem apart this way shows where the infrastructure
should be developed and applied.

> In some instances, hardware provides statistics that are indicative
> that a failure WILL occur in the near term future.

Ahhh.  I didn't think about that.  My experience with true hardware
failures is largely limited to cooling fans that won't spin at power-on,
power supplies that smoke at power-on, disk drives that fail to spin up
at power-on, and (rarely) drives that develop errors after either a few
days or a few years of use.  And also: the cable, card, or chip that
wasn't properly mated to the connector.

  Running routine 
> device Diagnostics may be helpful in this as well. Right now there
> is no standard mechanism to expose that data to user space applications.
> (Hooking into driverfs, sounds like the ultimate solution.)

Just so I understand a little about these new-fangled widgets that can
predict their own demise: is the usage model something like "offline the
device; perform diagnostics; if failed or failing replace the device and
online it, or leave it disabled and call the service team" ?

> > 	Generic question: why not focus the "hardening effort" on the
> > 	edges of the kernel interfaces, rather than on a 
> > driver-by-driver
> > 	basis?  Specifically: why not put the "professional paranoia"
> > 	into all of the kernel code that calls into drivers, and all
> > 	of the routines commonly called by drivers?  One could move
> > 	from a model of "this driver is hardened" to "all drivers
> > 	are suspect until proven otherwise."  Wouldn't that address
> > 	90% of the perceived problem up front, rather than spending 100%
> > 	effort to "harden" one driver at a time?
> 
> Great suggestion. Anymore suggestions on what kernel code we can apply
> our "professional paranoia" to?

Off the top of my head, no.  Much like the checking that occurs at the
user<-->kernel boundary, if you're truly suspicious of a driver, at
least some checking could occur at the edges between drivers and the
rest of the kernel.

The problem with this is that there are plenty of arguments
(performance, complexity, clutter, etc.) for why adding that kind of
checking is a bad idea for a general-use operating system.  Worse, the
fluidity of in-kernel interfaces would make this an on-going effort.

On the other hand, if it were a configurable build option, it could be a
way to protect against some errors in a large body of code with a
comparatively small amount of effort.

In an ideal world, one might be able to come up with an abstraction
layer and a set of interfaces for marshalling interactions across the
edges safely.  I'm not willing to tilt at that windmill, however. ;^)

I never said that our
> doc was a final solution, just a starting point for a discussion.

Mission: Successful. ;^)

> The big exception being "fault injection testing". I see value in 
> keeping FI testing.

Another area where I'm certain I'm ignorant.  Does this work along the
lines of supplying erroneous input (data, status bits, kernel requests,
etc.) into the driver to see what happens?

Andy