[Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation

James Bottomley James.Bottomley at HansenPartnership.com
Thu May 8 19:56:34 UTC 2014


On Thu, 2014-05-08 at 13:37 +0100, David Woodhouse wrote:
> I'd like to have a discussion about handling device errors.
> 
> IOMMUs are becoming more common, and we've seen some failure modes where
> we just end up with an endless stream of fault reports from a given
> device, and the kernel can do nothing else.

This is when the addresses being sent by the bus don't have IOTLB
entries?

> We may have various options for shutting it up — a PCI function level
> reset, power cycling the offending device, or maybe just configuring the
> IOMMU to *ignore* further errors from it, which would at least let the
> system get on with doing something useful (and if we do, when do we
> re-enable reporting?).
> 
> But I absolutely don't want us to be implementing policies like that in
> an individual IOMMU driver; this needs to be handled by generic device
> code. Once upon a time I might have said PCI code, but this is actually
> relevant for non-PCI devices too.

Right, with my PARISC hat on, our IOMMUs sit adjacent to the CPUs.  The
PCI busses (if we have any) are a couple of layers down.

> I want the IOMMU to report errors, and let the system do the appropriate
> thing. Which requires some discussion about what the "appropriate thing"
> can be in various circumstances, and indeed what options are available
> to us on various platforms.
> 
> Participants would be those working with IOMMUs on various platforms,
> including Jörg Rödel, myself, and hopefully someone with a fairly
> intimate knowledge of EEH as used on POWER systems.
> 
> We probably also want KVM folks to weigh in on how, if at all, they'd
> want errors on assigned devices to be reported to guests.
> 
> I strongly suspect that once we start looking at it, we'll find other
> triggers than "IOMMU faults" for starting to isolate and reset
> misbehaving devices. Interrupt storms perhaps being one of them — we've
> never been particularly robust to those, either.

I'd be interested ... if just to make sure that whatever's agreed to
isn't just intel IOMMU centric.

James





More information about the Ksummit-discuss mailing list