[Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation

Wed May 14 01:24:54 UTC 2014

On Thu, 2014-05-08 at 13:37 +0100, David Woodhouse wrote:
> I'd like to have a discussion about handling device errors.
> 
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.

 .../...

I'm definitely interested in this, and would nominate Gavin Shan from
IBM as well who is our EEH expert for the kernel.

To cut a long story short, we have an extensive set of HW facilities
in our PCI host bridges to detect errors and freeze all operations
in and out of devices upon detection of errors, in order to prevent
propagation of bad data.

In addition, we have a recovery process involving the few drivers
who support the corresponding hooks. We could describe the process,
it can be fairly convoluted.

We fallback to simulating an unplug of the device (unbind the driver),
a reset and a re-bind for devices that don't have the hooks.

Cheers,
Ben.