[Bugme-new] [Bug 41602] New: DRBD: possible deadlock in Ahead mode

bugzilla-daemon at bugzilla.kernel.org bugzilla-daemon at bugzilla.kernel.org
Tue Aug 23 04:29:38 PDT 2011


https://bugzilla.kernel.org/show_bug.cgi?id=41602

           Summary: DRBD: possible deadlock in Ahead mode
           Product: IO/Storage
           Version: 2.5
    Kernel Version: 3.0.1
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
        AssignedTo: io_other at kernel-bugs.osdl.org
        ReportedBy: cbay at excellency.fr
        Regression: No


My primary server started blocking all IO for no apparent reason. Connection
with the (long distance) secondary server was working as normal.

drbdadm disconnect didn't work either (it blocked, too.) I had to reboot the
server. DRBD is on top of mdadm, and both my disks were working properly.

I've never had any issue with 2.6.39 (and, before that, with previous versions,
although Ahead mode was not available.)


Configuration:

common {

  syncer {
    rate 100M;
    al-extents 3833;
    verify-alg md5;
    use-rle;
  }

  handlers {
    pri-on-incon-degr "echo O > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo O > /proc/sysrq-trigger ; halt -f";
  }

  disk {
    on-io-error detach;
  }

  net {
    ping-timeout 50;

    after-sb-0pri disconnect;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    rr-conflict disconnect;

    sndbuf-size 10M;
    rcvbuf-size 10M;

    max-buffers 2048;

    cram-hmac-alg "sha1";
    shared-secret "foobar";

    on-congestion pull-ahead;
    congestion-fill 10M;
  }

  startup {
    wfc-timeout 15;
    outdated-wfc-timeout 5;
    degr-wfc-timeout 5;
  }

  protocol A;
}


dmesg on the primary:

Aug 23 04:30:44 kernel: [741635.718352] block drbd0: Congestion-fill threshold
reached
Aug 23 04:30:44 kernel: [741635.718588] block drbd0: conn( Connected -> Ahead )
pdsk( UpToDate -> Consistent )
Aug 23 04:30:46 kernel: [741637.892460] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0
Aug 23 04:30:46 kernel: [741637.893614] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Aug 23 04:30:46 kernel: [741637.893666] block drbd0: conn( Ahead -> SyncSource
) pdsk( Consistent -> Inconsistent )
Aug 23 04:30:46 kernel: [741637.893714] block drbd0: Began resync as SyncSource
(will sync 92388 KB [23097 bits set]).
Aug 23 04:30:46 kernel: [741637.895396] block drbd0: updated sync UUID
31D068DFA6A5A469:0001000000000000:18EFD27FBECBDF53:18EED27FBECBDF53
Aug 23 04:30:55 kernel: [741646.209057] block drbd0: Resync done (total 8 sec;
paused 0 sec; 11548 K/sec)
Aug 23 04:30:55 kernel: [741646.209113] block drbd0: updated UUIDs
31D068DFA6A5A469:0000000000000000:0001000000000000:18EFD27FBECBDF53
Aug 23 04:30:55 kernel: [741646.209164] block drbd0: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate )
Aug 23 04:30:55 kernel: [741646.210922] block drbd0: bitmap WRITE of 0 pages
took 0 jiffies
Aug 23 04:30:55 kernel: [741646.211049] block drbd0: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Aug 23 04:31:06 kernel: [741657.464005] block drbd0: Congestion-fill threshold
reached
Aug 23 04:31:06 kernel: [741657.464045] block drbd0: conn( Connected -> Ahead )
pdsk( UpToDate -> Consistent )
Aug 23 04:31:26 kernel: [741676.975599] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967295
Aug 23 04:31:32 kernel: [741682.963823] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967294
Aug 23 04:31:38 kernel: [741688.952046] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967293
Aug 23 04:31:44 kernel: [741694.940272] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967292
Aug 23 04:31:50 kernel: [741700.928492] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967291
Aug 23 04:31:56 kernel: [741706.916721] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967290
Aug 23 04:32:02 kernel: [741712.904945] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967289
Aug 23 04:32:08 kernel: [741718.893169] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967288
Aug 23 04:32:14 kernel: [741724.881395] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967287
Aug 23 04:32:20 kernel: [741730.869619] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967286
Aug 23 04:32:26 kernel: [741736.857844] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967285
Aug 23 04:32:32 kernel: [741742.846068] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967284
Aug 23 04:32:38 kernel: [741748.834294] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967283
Aug 23 04:32:44 kernel: [741754.822521] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967282
Aug 23 04:32:50 kernel: [741760.810742] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967281
Aug 23 04:32:56 kernel: [741766.798967] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967280
Aug 23 04:33:02 kernel: [741772.787189] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967279
Aug 23 04:33:08 kernel: [741778.775416] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967278
Aug 23 04:33:14 kernel: [741784.763640] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967277
Aug 23 04:33:20 kernel: [741790.751865] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967276
Aug 23 04:33:26 kernel: [741796.740091] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967275
Aug 23 04:33:32 kernel: [741802.728315] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967274
Aug 23 04:33:38 kernel: [741808.716539] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967273
Aug 23 04:33:44 kernel: [741814.704764] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967272
Aug 23 04:33:50 kernel: [741820.692987] block drbd0: [drbd0_worker/4150]
sock_sendmsg time expired, ko = 4294967271
Aug 23 04:33:52 kernel: [741823.148165] INFO: task jbd2/drbd0-8:4296 blocked
for more than 120 seconds.
Aug 23 04:33:52 kernel: [741823.148202] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 23 04:33:52 kernel: [741823.148253] jbd2/drbd0-8    D 0000000000000001    
0  4296      2 0x00000000
Aug 23 04:33:52 kernel: [741823.148289]  ffff88023ecc0730 0000000000000046
ffff88022f28a010 ffff880233ad12e0
Aug 23 04:33:52 kernel: [741823.148342]  0000000000012480 ffff88022f28bfd8
ffff88022f28bfd8 0000000000012480
Aug 23 04:33:52 kernel: [741823.148395]  0000000000000001 ffff880233ad12e0
ffff880231d54f00 ffffffff8103d37b
Aug 23 04:33:52 kernel: [741823.148448] Call Trace:
Aug 23 04:33:52 kernel: [741823.148474]  [<ffffffff8103d37b>] ?
__wake_up_common+0x5b/0x90
Aug 23 04:33:52 kernel: [741823.148504]  [<ffffffff81010a62>] ?
read_tsc+0x12/0x40
Aug 23 04:33:52 kernel: [741823.148531]  [<ffffffff81071a5b>] ?
ktime_get_ts+0x6b/0xd0
Aug 23 04:33:52 kernel: [741823.148560]  [<ffffffff810a8e50>] ?
__lock_page+0x70/0x70
Aug 23 04:33:52 kernel: [741823.148589]  [<ffffffff8150eae7>] ?
io_schedule+0x97/0xf0
Aug 23 04:33:52 kernel: [741823.148616]  [<ffffffff810a8e59>] ?
sleep_on_page+0x9/0x10
Aug 23 04:33:52 kernel: [741823.148643]  [<ffffffff8150f3e0>] ?
__wait_on_bit+0x50/0x80
Aug 23 04:33:52 kernel: [741823.148671]  [<ffffffff810a90dc>] ?
wait_on_page_bit+0x6c/0x80
Aug 23 04:33:52 kernel: [741823.148699]  [<ffffffff81068690>] ?
autoremove_wake_function+0x30/0x30
Aug 23 04:33:52 kernel: [741823.148729]  [<ffffffff810b2b2a>] ?
pagevec_lookup_tag+0x1a/0x30
Aug 23 04:33:52 kernel: [741823.148757]  [<ffffffff810a9d1b>] ?
filemap_fdatawait_range+0xdb/0x150
Aug 23 04:33:52 kernel: [741823.148788]  [<ffffffff811aa92f>] ?
jbd2_journal_commit_transaction+0x85f/0x1340
Aug 23 04:33:52 kernel: [741823.148833]  [<ffffffff811ae6c3>] ?
kjournald2+0xb3/0x210
Aug 23 04:33:52 kernel: [741823.148860]  [<ffffffff81068660>] ?
wake_up_bit+0x40/0x40
Aug 23 04:33:52 kernel: [741823.148887]  [<ffffffff811ae610>] ?
commit_timeout+0x10/0x10
Aug 23 04:33:52 kernel: [741823.148916]  [<ffffffff810681e6>] ?
kthread+0x96/0xb0
Aug 23 04:33:52 kernel: [741823.148944]  [<ffffffff81517de4>] ?
kernel_thread_helper+0x4/0x10
Aug 23 04:33:52 kernel: [741823.148973]  [<ffffffff81068150>] ?
kthread_worker_fn+0x190/0x190
Aug 23 04:33:52 kernel: [741823.149001]  [<ffffffff81517de0>] ?
gs_change+0x13/0x13
Aug 23 04:33:52 kernel: [741823.149031] INFO: task mysqld:25472 blocked for
more than 120 seconds.
Aug 23 04:33:52 kernel: [741823.149059] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 23 04:33:52 kernel: [741823.149104] mysqld          D 0000000000000001    
0 25472   4633 0x00000000
Aug 23 04:33:52 kernel: [741823.149139]  ffff88023ecc0730 0000000000000086
ffff8801d600a010 ffff88023397af60
Aug 23 04:33:52 kernel: [741823.149192]  0000000000012480 ffff8801d600bfd8
ffff8801d600bfd8 0000000000012480
Aug 23 04:33:52 kernel: [741823.149245]  0000000000000001 ffff88023397af60
0000000000000000 0000000000000000
Aug 23 04:33:52 kernel: [741823.149297] Call Trace:
Aug 23 04:33:52 kernel: [741823.149321]  [<ffffffff81009729>] ?
__switch_to+0xe9/0x350
Aug 23 04:33:52 kernel: [741823.149348]  [<ffffffff810a8e50>] ?
__lock_page+0x70/0x70
Aug 23 04:33:52 kernel: [741823.149377]  [<ffffffff8150eae7>] ?
io_schedule+0x97/0xf0
Aug 23 04:33:52 kernel: [741823.149404]  [<ffffffff810a8e59>] ?
sleep_on_page+0x9/0x10
Aug 23 04:33:52 kernel: [741823.149431]  [<ffffffff8150f3e0>] ?
__wait_on_bit+0x50/0x80
Aug 23 04:33:52 kernel: [741823.149458]  [<ffffffff810a90dc>] ?
wait_on_page_bit+0x6c/0x80
Aug 23 04:33:52 kernel: [741823.149486]  [<ffffffff81068690>] ?
autoremove_wake_function+0x30/0x30
Aug 23 04:33:52 kernel: [741823.149515]  [<ffffffff810a949d>] ?
grab_cache_page_write_begin+0xad/0xf0
Aug 23 04:33:52 kernel: [741823.149547]  [<ffffffff8117ab2a>] ?
ext4_da_write_begin+0x14a/0x200
Aug 23 04:33:52 kernel: [741823.149575]  [<ffffffff811a8b98>] ?
jbd2_journal_stop+0x1b8/0x260
Aug 23 04:33:52 kernel: [741823.149604]  [<ffffffff810a8124>] ?
generic_file_buffered_write+0x104/0x240
Aug 23 04:33:52 kernel: [741823.149634]  [<ffffffff810aa2b8>] ?
__generic_file_aio_write+0x278/0x460
Aug 23 04:33:52 kernel: [741823.149664]  [<ffffffff810aa4f8>] ?
generic_file_aio_write+0x58/0xd0
Aug 23 04:33:52 kernel: [741823.149693]  [<ffffffff8116f821>] ?
ext4_file_write+0x71/0x2e0
Aug 23 04:33:52 kernel: [741823.149722]  [<ffffffff810e9a20>] ?
do_sync_write+0xc0/0x100
Aug 23 04:33:52 kernel: [741823.149752]  [<ffffffff8128d30c>] ?
security_file_permission+0x1c/0xb0
Aug 23 04:33:52 kernel: [741823.149781]  [<ffffffff810ea66b>] ?
vfs_write+0xcb/0x180
Aug 23 04:33:52 kernel: [741823.149808]  [<ffffffff810ea7c1>] ?
sys_pwrite64+0xa1/0xb0
Aug 23 04:33:52 kernel: [741823.149835]  [<ffffffff81516d52>] ?
system_call_fastpath+0x16/0x1b
Aug 23 04:33:52 kernel: [741823.149864] INFO: task flush-147:0:4858 blocked for
more than 120 seconds.
Aug 23 04:33:52 kernel: [741823.149892] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 23 04:33:52 kernel: [741823.149937] flush-147:0     D 0000000000000001    
0  4858      2 0x00000000
Aug 23 04:33:52 kernel: [741823.149971]  ffff88023ecc0730 0000000000000046
ffff88022891a010 ffff8802320f7450
Aug 23 04:33:52 kernel: [741823.150024]  0000000000012480 ffff88022891bfd8
ffff88022891bfd8 0000000000012480
Aug 23 04:33:52 kernel: [741823.150077]  0000000000000001 ffff8802320f7450
ffff880231d45000 ffff880231d45958
Aug 23 04:33:52 kernel: [741823.150130] Call Trace:
Aug 23 04:33:52 kernel: [741823.150152]  [<ffffffff810ab1d4>] ?
mempool_alloc+0x64/0x160
Aug 23 04:33:52 kernel: [741823.150184]  [<ffffffffa006cf55>] ?
bm_change_bits_to+0x85/0x150 [drbd]
Aug 23 04:33:52 kernel: [741823.150214]  [<ffffffff8150f1bd>] ?
schedule_timeout+0x1ed/0x220
Aug 23 04:33:52 kernel: [741823.150243]  [<ffffffff8103da33>] ?
__wake_up+0x43/0x70
Aug 23 04:33:52 kernel: [741823.150270]  [<ffffffff8104019a>] ?
select_task_rq_fair+0x47a/0x9c0
Aug 23 04:33:52 kernel: [741823.150299]  [<ffffffff8150ee1a>] ?
wait_for_common+0xfa/0x1a0
Aug 23 04:33:52 kernel: [741823.150327]  [<ffffffff81045c90>] ?
try_to_wake_up+0x260/0x260
Aug 23 04:33:52 kernel: [741823.150356]  [<ffffffffa00271b1>] ?
lc_get+0x41/0x250 [lru_cache]
Aug 23 04:33:52 kernel: [741823.150387]  [<ffffffffa0086b84>] ?
drbd_al_begin_io+0xf4/0x220 [drbd]
Aug 23 04:33:52 kernel: [741823.150418]  [<ffffffffa00875f0>] ?
drbd_al_read_log+0x3a0/0x3a0 [drbd]
Aug 23 04:33:52 kernel: [741823.150448]  [<ffffffff81119884>] ?
bio_alloc_bioset+0x54/0xf0
Aug 23 04:33:52 kernel: [741823.150476]  [<ffffffff811186d1>] ?
__bio_clone+0x21/0x70
Aug 23 04:33:52 kernel: [741823.150505]  [<ffffffffa0082a6e>] ?
drbd_make_request_common+0xc6e/0x1660 [drbd]
Aug 23 04:33:52 kernel: [741823.150550]  [<ffffffff8103da33>] ?
__wake_up+0x43/0x70
Aug 23 04:33:52 kernel: [741823.150579]  [<ffffffffa008378d>] ?
drbd_make_request+0x32d/0x10c0 [drbd]
Aug 23 04:33:52 kernel: [741823.150609]  [<ffffffff8117afab>] ?
ext4_io_submit+0x2b/0x50
Aug 23 04:33:52 kernel: [741823.150636]  [<ffffffff8117735d>] ?
mpage_da_map_and_submit+0x59d/0x9f0
Aug 23 04:33:52 kernel: [741823.150667]  [<ffffffff812b1f30>] ?
throtl_find_tg+0x40/0x60
Aug 23 04:33:52 kernel: [741823.150694]  [<ffffffff812b2526>] ?
blk_throtl_bio+0x396/0x4d0
Aug 23 04:33:52 kernel: [741823.150723]  [<ffffffff812a361b>] ?
generic_make_request+0x2db/0x3d0
Aug 23 04:33:52 kernel: [741823.150751]  [<ffffffff810ab1d4>] ?
mempool_alloc+0x64/0x160
Aug 23 04:33:52 kernel: [741823.151823]  [<ffffffff812a3781>] ?
submit_bio+0x71/0xf0
Aug 23 04:33:52 kernel: [741823.151851]  [<ffffffff81113bb8>] ?
submit_bh+0xe8/0x150
Aug 23 04:33:52 kernel: [741823.151878]  [<ffffffff81116da0>] ?
__block_write_full_page+0x1e0/0x3d0
Aug 23 04:33:52 kernel: [741823.151907]  [<ffffffff81116540>] ?
end_buffer_async_read+0x160/0x160
Aug 23 04:33:52 kernel: [741823.151936]  [<ffffffff81119ac0>] ?
I_BDEV+0x10/0x10
Aug 23 04:33:52 kernel: [741823.151963]  [<ffffffff810b088a>] ?
__writepage+0xa/0x40
Aug 23 04:33:52 kernel: [741823.151992]  [<ffffffff810b1231>] ?
write_cache_pages+0x1e1/0x410
Aug 23 04:33:52 kernel: [741823.152021]  [<ffffffff810b0880>] ?
get_zeroed_page+0x10/0x10
Aug 23 04:33:52 kernel: [741823.152049]  [<ffffffff810b14a8>] ?
generic_writepages+0x48/0x70
Aug 23 04:33:52 kernel: [741823.152078]  [<ffffffff8110d220>] ?
writeback_single_inode+0xe0/0x200
Aug 23 04:33:52 kernel: [741823.152107]  [<ffffffff8110d88e>] ?
writeback_sb_inodes+0xae/0x190
Aug 23 04:33:52 kernel: [741823.152136]  [<ffffffff8110dfff>] ?
writeback_inodes_wb+0xbf/0x140
Aug 23 04:33:52 kernel: [741823.152165]  [<ffffffff8110e33d>] ?
wb_writeback+0x2bd/0x310
Aug 23 04:33:52 kernel: [741823.152193]  [<ffffffff8110e58b>] ?
wb_do_writeback+0x1fb/0x210
Aug 23 04:33:52 kernel: [741823.152222]  [<ffffffff8110e62a>] ?
bdi_writeback_thread+0x8a/0x170
Aug 23 04:33:52 kernel: [741823.152250]  [<ffffffff8110e5a0>] ?
wb_do_writeback+0x210/0x210
Aug 23 04:33:52 kernel: [741823.152279]  [<ffffffff810681e6>] ?
kthread+0x96/0xb0
Aug 23 04:33:52 kernel: [741823.152305]  [<ffffffff81517de4>] ?
kernel_thread_helper+0x4/0x10
Aug 23 04:33:52 kernel: [741823.152334]  [<ffffffff81068150>] ?
kthread_worker_fn+0x190/0x190
Aug 23 04:33:52 kernel: [741823.152363]  [<ffffffff81517de0>] ?
gs_change+0x13/0x13

(and so on, until the reboot)

Nothing suspicious in the secondary's dmesg. Both nodes were still marked as
connected.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the Bugme-new mailing list