[Bugme-new] [Bug 32682] New: libata freeze randomly hangs disk I/O unless libata.dma=0 is set

Mon Apr 4 23:19:23 PDT 2011

https://bugzilla.kernel.org/show_bug.cgi?id=32682

           Summary: libata freeze randomly hangs disk I/O unless
                    libata.dma=0 is set
           Product: IO/Storage
           Version: 2.5
    Kernel Version: 2.6.38.2
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Serial ATA
        AssignedTo: jgarzik at pobox.com
        ReportedBy: robin at rainton.com
        Regression: No

There seem to be numerous issues on this topic but none mention DMA as far as I
can tell.

I have seen this problem with the following kernels:

2.6.28.10
2.6.35.9
2.6.38.2

This is on a 64bit CentOS 5.5 distro. Some parts of the distro are not
compatible with the newer kernels but I was experimenting to see if the problem
was corrected in a newer kernel. Sadly it seems not. Hardware in the system is
GeForce 8200 motherboard (6 x sata) with SiL3114 PCI card (4 x SATA).

The SiL has 2 x 200GB Seagate drives in RAID 1 (boot + root)

The GeForce has 6 x 500GB Samsung in RAID 5

I have seen the following types of message in relation to every drive in this
box. Ie. on both controllers. While some people suspect these errors are caused
by faulty hardware there is no way that every drive and every port can be bad.
The system has a 450W PSU but estimations by looking at the UPS it's connected
to indicate power draw of well under 200W (the CPU in this system is a AMD
4850e).

Moreover, hardware problems can be excluded as the fault seems to be fixed by
adding the following during boot as a kernel argument: "libata.dma=0"

The timing of the fault is seemingly random. System load does not seem to play
a part.

This problem manifests itself as a lockup/freeze of processes that are
performing/waiting for disk I/O (those not accessing disks continue just fine).
Inspecting logs after such an event one sees the below.

Although the logging covers only a few seconds the I/O hang appears to be a lot
longer (30 seconds or more).

Note that in the system above with 10 SATA ports and 8 drives the 'ata7' could
be replaced by any other number.

Apr  4 03:09:11 plex kernel: ata7.00: exception Emask 0x0 SAct 0xf SErr 0x0
action 0x6 frozen
Apr  4 03:09:11 plex kernel: ata7.00: failed command: READ FPDMA QUEUED
Apr  4 03:09:11 plex kernel: ata7.00: cmd 60/50:00:4d:14:bb/00:00:0a:00:00/40
tag 0 ncq 40960 in
Apr  4 03:09:11 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  4 03:09:11 plex kernel: ata7.00: status: { DRDY }
Apr  4 03:09:11 plex kernel: ata7.00: failed command: READ FPDMA QUEUED
Apr  4 03:09:11 plex kernel: ata7.00: cmd 60/80:08:cd:12:bb/00:00:0a:00:00/40
tag 1 ncq 65536 in
Apr  4 03:09:11 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  4 03:09:11 plex kernel: ata7.00: status: { DRDY }
Apr  4 03:09:11 plex kernel: ata7.00: failed command: READ FPDMA QUEUED
Apr  4 03:09:11 plex kernel: ata7.00: cmd 60/80:10:4d:13:bb/00:00:0a:00:00/40
tag 2 ncq 65536 in
Apr  4 03:09:11 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  4 03:09:11 plex kernel: ata7.00: status: { DRDY }
Apr  4 03:09:11 plex kernel: ata7.00: failed command: READ FPDMA QUEUED
Apr  4 03:09:11 plex kernel: ata7.00: cmd 60/08:18:8d:00:dd/00:00:0a:00:00/40
tag 3 ncq 4096 in
Apr  4 03:09:11 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  4 03:09:11 plex kernel: ata7.00: status: { DRDY }
Apr  4 03:09:11 plex kernel: ata7: hard resetting link
Apr  4 03:09:11 plex kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl
300)
Apr  4 03:09:11 plex kernel: ata7.00: configured for UDMA/133
Apr  4 03:09:11 plex kernel: ata7.00: device reported invalid CHS sector 0
Apr  4 03:09:11 plex last message repeated 3 times
Apr  4 03:09:11 plex kernel: ata7: EH complete

Earlier Kernels had slightly different output:

Apr  3 04:54:05 plex kernel: ata6.00: NCQ disabled due to excessive errors
Apr  3 04:54:05 plex kernel: ata6.00: exception Emask 0x0 SAct 0x7ff SErr 0x0
action 0x6 frozen
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/68:00:65:c0:ac/00:00:01:00:00/40
tag 0 ncq 53248 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/a0:08:2d:bd:ac/00:00:01:00:00/40
tag 1 ncq 81920 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/00:10:cd:bd:ac/01:00:01:00:00/40
tag 2 ncq 131072 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/00:18:cd:c0:ac/01:00:01:00:00/40
tag 3 ncq 131072 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/00:20:cd:be:ac/01:00:01:00:00/40
tag 4 ncq 131072 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/98:28:cd:bf:ac/00:00:01:00:00/40
tag 5 ncq 77824 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/f0:30:cd:c1:ac/00:00:01:00:00/40
tag 6 ncq 122880 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/10:38:bd:c2:ac/00:00:01:00:00/40
tag 7 ncq 8192 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/00:40:cd:c2:ac/01:00:01:00:00/40
tag 8 ncq 131072 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/28:48:cd:c3:ac/00:00:01:00:00/40
tag 9 ncq 20480 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6.00: cmd 60/60:50:cd:bc:ac/00:00:01:00:00/40
tag 10 ncq 49152 in
Apr  3 04:54:05 plex kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00
Emask 0x4 (timeout)
Apr  3 04:54:05 plex kernel: ata6.00: status: { DRDY }
Apr  3 04:54:05 plex kernel: ata6: hard resetting link
Apr  3 04:54:06 plex kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl
300)
Apr  3 04:54:06 plex kernel: ata6.00: configured for UDMA/133
Apr  3 04:54:06 plex kernel: ata6: EH complete
Apr  3 04:54:06 plex kernel: sd 5:0:0:0: [sdf] 976773168 512-byte hardware
sectors: (500 GB/465 GiB)
Apr  3 04:54:06 plex kernel: sd 5:0:0:0: [sdf] Write Protect is off
Apr  3 04:54:06 plex kernel: sd 5:0:0:0: [sdf] Write cache: enabled, read
cache: enabled, doesn't support DPO or FUA

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.