[cgl_discussion] Re: [Hardeneddrivers-discuss] RE: Some Initial Comments on DDH-Spec-0.5h.pdf

Jeff Garzik jgarzik at pobox.com
Tue Sep 24 17:29:02 PDT 2002


Andy Pfiffer wrote:
>>Problem Statement:
>>=================
>>Drivers fail and have bugs. The bugs are fixed, new features are added,
>>the cycle repeats.
>>
>>We are trying to do two things with the current doc:
>>
>>1.  Provide guidelines to help the driver creation process such that
>>    drivers stabalize faster and that trouble-case errors are called
>>    out to developers so they watch for them.
>>
>>2.  In the event of a failure or indication that a failure will occur,
>>    enable drivers and user space applications to take actions to reduce
>>    the amount of time required for recovery.
> 
> 
> Much better.  These are clear statements.  So, do we take a "done" on
> The Guidelines when the edited version appears in kernel/Documentation?
> ;^)
> 
> The thought also crossed my mind that with some heavy lifting in
> #define's, one might be able to throw a config switch that would flag
> some, but not all (think pointer indirection into a memory-mapped
> device) driver idioms that go against the guidelines.  Some might
> consider it "training wheels for driver writers", others might consider
> it a useful tool.

No need for a switch...  if certain facets of a driver go against 
guidelines, that's either (a) a bug or (b) intentional.  Either way, a 
switch won't do you much good.


>>In some instances, hardware provides statistics that are indicative
>>that a failure WILL occur in the near term future.
> 
> 
> Ahhh.  I didn't think about that.  My experience with true hardware
> failures is largely limited to cooling fans that won't spin at power-on,
> power supplies that smoke at power-on, disk drives that fail to spin up
> at power-on, and (rarely) drives that develop errors after either a few
> days or a few years of use.  And also: the cable, card, or chip that
> wasn't properly mated to the connector.

FWIW problem prediction doesn't really exist in a lot of commodity 
hardware.  A lot of commodity disk drives have them (SMART command set), 
but fault-prediction hardware IMO tends to be more expensive and 
enterprise-level.


>>Great suggestion. Anymore suggestions on what kernel code we can apply
>>our "professional paranoia" to?
> 
> 
> Off the top of my head, no.  Much like the checking that occurs at the
> user<-->kernel boundary, if you're truly suspicious of a driver, at
> least some checking could occur at the edges between drivers and the
> rest of the kernel.

Depends on how you define suspicious...  security-wise this is an awful 
gauge.  But for protecting against programming errors, additional 
runtime debugging checks are not only useful, but some can already be 
found under CONFIG_DEBUG_KERNEL


> The problem with this is that there are plenty of arguments
> (performance, complexity, clutter, etc.) for why adding that kind of
> checking is a bad idea for a general-use operating system.  Worse, the
> fluidity of in-kernel interfaces would make this an on-going effort.
> 
> On the other hand, if it were a configurable build option, it could be a
> way to protect against some errors in a large body of code with a
> comparatively small amount of effort.

such as the stuff under CONFIG_DEBUG_KERNEL?  :)


>>The big exception being "fault injection testing". I see value in 
>>keeping FI testing.
> 
> 
> Another area where I'm certain I'm ignorant.  Does this work along the
> lines of supplying erroneous input (data, status bits, kernel requests,
> etc.) into the driver to see what happens?

Fault injection is a subset of testing tools.  It's just another piece 
of the kernel toolkit.

However, protecting against certain types of errors in the kernel is 
just way too expensive, from a LOC perspective.  If one checks for 
errors that only occur during fault injection, you're just making the 
code more dense, more complexity for absolutely no value.  Checking 
every bit of every field in a NIC's RX DMA descriptor for sanity, for 
example, is pretty silly.  You simply cannot protect every random, crazy 
thing the hardware might randomly do (such as the classic single-bit 
flip in a bad RAM chip) without overly burdening the kernel code.

	Jeff






More information about the cgl_discussion mailing list