[Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation

Benjamin Herrenschmidt benh at kernel.crashing.org
Wed May 14 01:50:08 UTC 2014


On Tue, 2014-05-13 at 12:27 +0100, David Woodhouse wrote:
> You probably don't want to completely isolate it in that case. If it's
> doing some bad DMA *and* it's also doing some good DMA to display its
> framebuffer, why stop the latter?

I don't think you can go to that level of granularity. We certainly
can't on power.

Propagation of bad data due to faulty adapters or simple bit flips
is a real big issue on servers and the policy for us is simple, on the
first "hint" of an error, block *all* traffic to an from the adapter.

Then the driver can get into the dance to figure out what's up (we can
selectively enable MMIO under driver control to try to get at diagnostic
registers for example) and reset / reconfigure things.

> The Intel IOMMU at least can be configured to avoid reporting faults for
> a given device (well, requester-id). So valid transactions still happen,
> while invalid transactions are still blocked. But silently, without
> bothering the host with the details and causing a fault-IRQ storm.

I would argue against that sort of policy. At least in server contexts.

It could well be that this is appropriate for laptops/desktops, I don't know,
but once an adapter starts doing bad DMAs, I think you can't really trust
much out of it anymore at all.

Cheers,
Ben.




More information about the Ksummit-discuss mailing list