[Bugme-new] [Bug 12553] New: System slows down to a crawl during and after prolonged disk usage

Tue Jan 27 11:40:31 PST 2009

http://bugzilla.kernel.org/show_bug.cgi?id=12553

           Summary: System slows down to a crawl during and after prolonged
                    disk usage
           Product: IO/Storage
           Version: 2.5
     KernelVersion: 2.6.28.1
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: Other
        AssignedTo: io_other at kernel-bugs.osdl.org
        ReportedBy: matti.niemenmaa+kernelbugs at iki.fi

Latest working kernel version: none known, I have only tested the 2.6.28 series
and can't easily try anything else since I'm on ext4.
Earliest failing kernel version: 2.6.28
Distribution: Arch Linux
Hardware Environment: Intel Core 2 Q9550 (x86-64), 8 GB RAM, Asus P5Q-E
motherboard, Seagate Barracuda 7200.11 1.5 TB disk.
Software Environment: "none". Reproduced at runlevel 1 with practically nothing
running.

Heavy disk usage or something which simply continues for a period of time, e.g.
downloading torrents, watching video, or simply copying a lot of stuff from one
partition to another makes Linux eventually slow down to the point of being
unusable. This slowdown persists even after the IO has finished and nothing is
using the disk. A reboot is a temporary fix for the problem.

This could be related to Bug 7372 or Bug 12309, but I think the key here is
that I tend to get system-wide poor performance even after the IO is done: when
both 'top' and 'iotop' tell me that for all practical purposes absolutely
nothing is going on.

What happens is that literally /everything/ takes more CPU time. A good
benchmark I found for testing this is 'time man man >/dev/null': typically it
claims around 0.10 seconds of user time, but when this slowdown has occurred it
can take over 8 seconds. An strace -tt doesn't point at any particular culprit:
every individual operation takes many times more time.

The Windows I occasionally dual boot into has no problems at all in this
regard.

I have no idea of the actual cause of the problem, I can only describe the
symptoms. I filed this under IO/Storage as my best guess.

Steps to reproduce:

I did an experiment at runlevel 1. First I dd'd a 500M file from /dev/zero.
Then I started up two programs to run repeatedly: 'time strace -fttT man man
>/dev/null; sleep 1' for benchmarking and 'cp zero-file zero-file2; rm
zero-file2' to reproduce the problem. I also set up 'vmstat 1' in case it gives
anything useful. I left this running for a few hours and logged everything.

This started at around 9:20 local time. The logs of the 'time man man' show
that for a long time all of the processes completed within 0.15 seconds of user
time, most below 0.10. Three hours later, at 12:21:03, we start to see the
following:

0.10user 0.13system 0:00.23elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k
0.07user 0.10system 0:02.79elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
2.60user 0.44system 0:07.92elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
3.15user 0.51system 0:05.74elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k
4.54user 0.68system 0:12.46elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
5.05user 0.53system 0:11.19elapsed 49%CPU (0avgtext+0avgdata 0maxresident)k
4.28user 0.55system 0:11.39elapsed 42%CPU (0avgtext+0avgdata 0maxresident)k
0.10user 0.14system 0:03.60elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
0.08user 0.16system 0:00.82elapsed 30%CPU (0avgtext+0avgdata 0maxresident)k

The times spiked really high for a while there. (Normally I'd reboot around
this time, doing almost anything is painfully slow.) This spiking goes on and
on. An hour later I arrive at the scene and kill the 'cp' loop. About 10
minutes later this happens:

0.16user 0.16system 0:03.16elapsed 10%CPU (0avgtext+0avgdata 0maxresident)k
0.36user 0.15system 0:03.62elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
1.49user 0.22system 0:01.99elapsed 86%CPU (0avgtext+0avgdata 0maxresident)k
3.12user 0.28system 0:05.33elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k
4.03user 0.50system 0:06.45elapsed 70%CPU (0avgtext+0avgdata 0maxresident)k
4.61user 0.51system 0:08.89elapsed 57%CPU (0avgtext+0avgdata 0maxresident)k
4.40user 0.58system 0:09.18elapsed 54%CPU (0avgtext+0avgdata 0maxresident)k
5.31user 0.54system 0:09.94elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k
5.43user 0.55system 0:10.71elapsed 55%CPU (0avgtext+0avgdata 0maxresident)k
5.36user 0.57system 0:10.03elapsed 59%CPU (0avgtext+0avgdata 0maxresident)k
5.31user 0.70system 0:07.93elapsed 75%CPU (0avgtext+0avgdata 0maxresident)k
5.30user 0.66system 0:09.76elapsed 61%CPU (0avgtext+0avgdata 0maxresident)k
5.35user 0.75system 0:09.37elapsed 65%CPU (0avgtext+0avgdata 0maxresident)k

The times just stayed up from then on. After 10 minutes of times ranging from 5
to 8 seconds of user time I ended the experiment.

I omitted the 'pagefaults' lines from the time output: they're all fairly
identical, 1416 outputs and around 7700 minor pagefaults, everything else is at
zero.

Attached are two straces: the first one and the first one that managed over 7
seconds of user time, among the last of all the runs. I'll also attach my
kernel .config, a boot dmesg, and lspci -vvv output; I don't know what other
information would be useful.

Here's the ver_linux output for the kernel I ran my experiment on:

Linux niðavellir 2.6.28.1-deewiant #20 SMP PREEMPT Fri Jan 23 22:20:54 EET
2009 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux

Gnu C                  4.3.2
Gnu make               3.81
binutils               2.19.0.20081119
util-linux             2.14
mount                  support
module-init-tools      3.5
e2fsprogs              1.41.3
pcmciautils            015
PPP                    2.4.4
Linux C Library        2.9
Dynamic linker (ldd)   2.9
Linux C++ Library      6.0.10
Procps                 3.2.7
Net-tools              1.60
Kbd                    1.13
Sh-utils               6.12
Modules Loaded

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.