[Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation

Bjorn Helgaas bhelgaas at google.com
Thu May 8 18:03:39 UTC 2014


On Thu, May 8, 2014 at 6:37 AM, David Woodhouse <dwmw2 at infradead.org> wrote:
> I'd like to have a discussion about handling device errors.
>
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.
>
> We may have various options for shutting it up — a PCI function level
> reset, power cycling the offending device, or maybe just configuring the
> IOMMU to *ignore* further errors from it, which would at least let the
> system get on with doing something useful (and if we do, when do we
> re-enable reporting?).
>
> But I absolutely don't want us to be implementing policies like that in
> an individual IOMMU driver; this needs to be handled by generic device
> code. Once upon a time I might have said PCI code, but this is actually
> relevant for non-PCI devices too.
>
> I want the IOMMU to report errors, and let the system do the appropriate
> thing. Which requires some discussion about what the "appropriate thing"
> can be in various circumstances, and indeed what options are available
> to us on various platforms.

I'm interested in this discussion, too.


More information about the Ksummit-discuss mailing list