[Ksummit-discuss] [CORE TOPIC] Device error handling / reporting / isolation

Matthew Wilcox willy6545 at gmail.com
Fri May 9 17:58:03 UTC 2014


I'm hearing a bunch of FUD around NVMe hotplug but precious little in the
way of bug reports! Keith Busch has been doing a stellar job of fixing up
the bugs that he's found, but I have seen precisely zero hotplug bugs
reported to the NVMe mailing list. So put up or shut up.
 On 2014-05-09 1:49 PM, "Roland Dreier" <roland at kernel.org> wrote:

> On Thu, May 8, 2014 at 5:37 AM, David Woodhouse <dwmw2 at infradead.org>
> wrote:
> > I'd like to have a discussion about handling device errors.
> >
> > IOMMUs are becoming more common, and we've seen some failure modes where
> > we just end up with an endless stream of fault reports from a given
> > device, and the kernel can do nothing else.
> >
> > We may have various options for shutting it up — a PCI function level
> > reset, power cycling the offending device, or maybe just configuring the
> > IOMMU to *ignore* further errors from it, which would at least let the
> > system get on with doing something useful (and if we do, when do we
> > re-enable reporting?).
>
> I think there's a more general problem that's worth talking about
> here.  In addition to IOMMU faults, there are lots of other PCI errors
> that can happen, and we have some small number of drivers that have
> been "hardened" to try and recover from these errors.  However even
> for these "hardened" drivers it seems pretty easy to hit deadlocks
> when the driver tries to tear down and reinitialize things.
>
> So I wonder if we can do better without proliferating error handling
> tentacles into all sorts of low-level drivers ("did we just read
> 0xffffffff here?  how about here?  are we in the middle of error
> recovery?  how about now?").
>
> One context where this is becoming a real concern is with NVMe drives.
>  These are SSDs that (may) look like normal 2.5" drives, but use PCIe
> rather than SATA or SAS to connect to the host.  Since they look like
> normal drives, it's natural to put them into hot-pluggable JBODs, but
> it turns out we react much worse to PCIe surprise removal than, say,
> SAS hotplug.
>
>  - R.
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss at lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/ksummit-discuss/attachments/20140509/6893499a/attachment.html>


More information about the Ksummit-discuss mailing list