[Bugme-new] [Bug 12569] New: ocfs2 dlmmaster kernel bug

bugme-daemon at bugzilla.kernel.org bugme-daemon at bugzilla.kernel.org
Wed Jan 28 17:27:38 PST 2009


http://bugzilla.kernel.org/show_bug.cgi?id=12569

           Summary: ocfs2 dlmmaster kernel bug
           Product: File System
           Version: 2.5
     KernelVersion: 2.6.25.18-0.2
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: blocking
          Priority: P1
         Component: Other
        AssignedTo: fs_other at kernel-bugs.osdl.org
        ReportedBy: barzog at telecom.by


Latest working kernel version: none
Earliest failing kernel version: 2.6.25.18-0.2
Distribution: openSUSE 11.0
Hardware Environment: HP DL 140G4/6Gb RAM/amd64
Software Environment:
Problem Description:

I'have 3 identical hardware servers with LAMP. Two with openSUSE 11.0 and one
with Gentoo which using some SAN storage through FC cards with help of OCFS2.
Randomly, but at least 2 times per days all servers fail and all operations on
ocfs2 filesystem blocks with "Permission denied" error. Only hardware reboot at
least 2 nodes helps. Today I've tracked some messages regarding this problem.
At the begin I've saw that all my servers begin to answer error 500 at any
requests. I've stopped all web stuff (apache/nginx) using ocfs2 and unmount all
ocfs2 shares. At one node during unmount of one of the ocfs2 shares I've saw
following:
falcon-cl3:~ # umount /home/apache/users/D1
Segmentation fault
Jan 29 01:31:49 falcon-cl3 kernel: (19786,3):dlm_empty_lockres:2783 ERROR:
lockres W0000000000000002ad047e83635789 still has local locks!
Jan 29 01:31:49 falcon-cl3 kernel: ------------[ cut here ]------------
Jan 29 01:31:49 falcon-cl3 kernel: kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2784!
Jan 29 01:31:49 falcon-cl3 kernel: invalid opcode: 0000 [1] SMP
Jan 29 01:31:49 falcon-cl3 kernel: last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
Jan 29 01:31:49 falcon-cl3 kernel: CPU 3
Jan 29 01:31:49 falcon-cl3 kernel: Modules linked in: xt_pkttype xt_TCPMSS
xt_tcpudp ipt_LOG xt_limit ocfs2 8021q ipt_REJECT xt_state iptable_mangle
iptable_nat nf_nat iptable_filter nf_conntrack_netbios_ns nf_conntrack_ipv4
nf_conntrack ip_tables ip6_tables x_tables microcode fuse ocfs2_dlmfs ocfs2_dlm
ocfs2_nodemanager configfs ext2 loop dm_mod qla2xxx firmware_class
scsi_transport_fc i2c_i801 iTCO_wdt iTCO_vendor_support scsi_tgt i2c_core
shpchp pci_hotplug sg i5000_edac tg3 edac_core usbhid hid ff_memless raid456
async_xor async_memcpy async_tx xor raid0 sd_mod ehci_hcd uhci_hcd usbcore edd
raid1 ext3 mbcache jbd ata_piix ahci libata scsi_mod dock
Jan 29 01:31:49 falcon-cl3 kernel: Pid: 19786, comm: umount Tainted: G        N
2.6.25.18-0.2-default #1
Jan 29 01:31:49 falcon-cl3 kernel: RIP: 0010:[<ffffffff88328f78>] 
[<ffffffff88328f78>] :ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2
Jan 29 01:31:49 falcon-cl3 kernel: RSP: 0018:ffff81010e865b28  EFLAGS: 00010292
Jan 29 01:31:49 falcon-cl3 kernel: RAX: 000000000000006a RBX: 000000000000001f
RCX: ffff81006d0d3800
Jan 29 01:31:49 falcon-cl3 kernel: RDX: 0000000100000000 RSI: 0000000000000096
RDI: 0000000000000296
Jan 29 01:31:49 falcon-cl3 kernel: RBP: ffff81010e865c58 R08: ffffffff80607ee0
R09: ffff810001031280
Jan 29 01:31:49 falcon-cl3 kernel: R10: 0000000000000000 R11: 00001243dc075bdd
R12: 0000000000000001
Jan 29 01:31:49 falcon-cl3 kernel: R13: ffff81006d0d3800 R14: ffff81006d0d3800
R15: ffff81006d0d3898
Jan 29 01:31:49 falcon-cl3 kernel: FS:  00007f18e6ace6f0(0000)
GS:ffff8101ef0d9a40(0000) knlGS:0000000000000000
Jan 29 01:31:49 falcon-cl3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Jan 29 01:31:49 falcon-cl3 kernel: CR2: 00007f019d6a74a8 CR3: 0000000104898000
CR4: 00000000000006a0
Jan 29 01:31:49 falcon-cl3 kernel: DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
Jan 29 01:31:49 falcon-cl3 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0
DR7: 0000000000000400
Jan 29 01:31:49 falcon-cl3 kernel: Process umount (pid: 19786, threadinfo
ffff81010e864000, task ffff8101650080c0)
Jan 29 01:31:49 falcon-cl3 kernel: Stack:  ffff81010d125320 ffffffff00000001
0000000000000000 ffff81010e865bf8
Jan 29 01:31:49 falcon-cl3 kernel:  0000000000000092 ffff81010e865c10
ffff8101ec18b400 ffffffd900000000
Jan 29 01:31:50 falcon-cl3 kernel:  ffff8101ec18b45c 0000000000000000
ffff81006d0d3828 000000038022e7eb
Jan 29 01:31:50 falcon-cl3 kernel: Call Trace:
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8831f6b6>]
:ocfs2_dlm:dlm_unregister_domain+0xe0/0x3d2
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff88404c17>]
:ocfs2:ocfs2_dlm_shutdown+0x78/0x94
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8841e412>]
:ocfs2:ocfs2_dismount_volume+0x1e1/0x2ec
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8842084e>]
:ocfs2:ocfs2_put_super+0x1b/0x20
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a2558>]
generic_shutdown_super+0x63/0xf1
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a25fb>]
kill_block_super+0x15/0x2a
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a26dc>]
deactivate_super+0x69/0x7e
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b6e9b>]
mntput_no_expire+0x55/0x85
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b7626>] sys_umount+0x311/0x343
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020bffa>]
system_call_after_swapgs+0x8a/0x8f
Jan 29 01:31:50 falcon-cl3 kernel:  [<00007f18e6438147>]
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel: Code: df 0a 00 00 45 8b 4d 14 48 c7 c1 b0 fa
32 88 48 c7 c7 98 20 33 88 65 8b 14 25 24 00 00 00 48 89 04 24 89 d2 31 c0 e8
37 e4 11 f8 <0f> 0b eb fe bf 64 00 00 00 e8 b1 79 f1 f7 e9 f3 f3 ff ff 48 8b
Jan 29 01:31:50 falcon-cl3 kernel: RIP  [<ffffffff88328f78>]
:ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2
Jan 29 01:31:50 falcon-cl3 kernel:  RSP <ffff81010e865b28>
Jan 29 01:31:50 falcon-cl3 kernel: ---[ end trace 7004c8862242a9a5 ]---
Jan 29 01:31:50 falcon-cl3 kernel: ------------[ cut here ]------------
Jan 29 01:31:50 falcon-cl3 kernel: WARNING: at kernel/exit.c:892
do_exit+0x41/0x6ec()
Jan 29 01:31:50 falcon-cl3 kernel: Modules linked in: xt_pkttype xt_TCPMSS
xt_tcpudp ipt_LOG xt_limit ocfs2 8021q ipt_REJECT xt_state iptable_mangle
iptable_nat nf_nat iptable_filter nf_conntrack_netbios_ns nf_conntrack_ipv4
nf_conntrack ip_tables ip6_tables x_tables microcode fuse ocfs2_dlmfs ocfs2_dlm
ocfs2_nodemanager configfs ext2 loop dm_mod qla2xxx firmware_class
scsi_transport_fc i2c_i801 iTCO_wdt iTCO_vendor_support scsi_tgt i2c_core
shpchp pci_hotplug sg i5000_edac tg3 edac_core usbhid hid ff_memless raid456
async_xor async_memcpy async_tx xor raid0 sd_mod ehci_hcd uhci_hcd usbcore edd
raid1 ext3 mbcache jbd ata_piix ahci libata scsi_mod dock
Jan 29 01:31:50 falcon-cl3 kernel: Pid: 19786, comm: umount Tainted: G      D N
2.6.25.18-0.2-default #1
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel: Call Trace:
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020d696>] dump_trace+0xc4/0x576
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020db88>] show_trace+0x40/0x57
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff80447249>] dump_stack+0x72/0x7b
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff80237527>]
warn_on_slowpath+0x58/0x80
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8023ade6>] do_exit+0x41/0x6ec
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8044a37f>] oops_begin+0x0/0xa0
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020dfd9>] die+0x5d/0x66
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8044a8aa>] do_trap+0x110/0x11f
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020e70c>]
do_invalid_op+0xa0/0xa9
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff80449ca9>] error_exit+0x0/0x60
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff88328f78>]
:ocfs2_dlm:dlm_empty_lockres+0xe0d/0xea2
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8831f6b6>]
:ocfs2_dlm:dlm_unregister_domain+0xe0/0x3d2
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff88404c17>]
:ocfs2:ocfs2_dlm_shutdown+0x78/0x94
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8841e412>]
:ocfs2:ocfs2_dismount_volume+0x1e1/0x2ec
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8842084e>]
:ocfs2:ocfs2_put_super+0x1b/0x20
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a2558>]
generic_shutdown_super+0x63/0xf1
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a25fb>]
kill_block_super+0x15/0x2a
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802a26dc>]
deactivate_super+0x69/0x7e
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b6e9b>]
mntput_no_expire+0x55/0x85
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff802b7626>] sys_umount+0x311/0x343
Jan 29 01:31:50 falcon-cl3 kernel:  [<ffffffff8020bffa>]
system_call_after_swapgs+0x8a/0x8f
Jan 29 01:31:50 falcon-cl3 kernel:  [<00007f18e6438147>]
Jan 29 01:31:50 falcon-cl3 kernel:
Jan 29 01:31:50 falcon-cl3 kernel: ---[ end trace 7004c8862242a9a5 ]---

And all operation on this share on other nodes fails. Software reboot using
reboot command failed, so I rebooted used power switch. During initial mount in
logs I've saw following:
Jan 29 02:27:07 falcon-cl1 kernel: o2net: connection to node falcon-cl3 (num 1)
at 172.16.2.127:7777 has been idle for 30.0 seconds, shutting it down.
Jan 29 02:27:07 falcon-cl1 kernel: (10235,3):o2net_idle_timer:1414 here are
some times that might help debug the situation: (tmr 1233188797.269265 now
1233188827.266240 dr 1233188797.269248 adv 1233188797.269268:1233188797.269268
func (9df3a87d:513) 1233187854.268609:1233187854.268622)
Jan 29 02:27:07 falcon-cl1 kernel: o2net: no longer connected to node
falcon-cl3 (num 1) at 172.16.2.127:7777
Jan 29 02:27:07 falcon-cl1 kernel: (9637,3):dlm_do_master_request:1409 ERROR:
link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (10234,1):dlm_do_master_request:1409 ERROR:
link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (10234,1):dlm_get_lock_resource:986 ERROR:
status = -112
Jan 29 02:27:07 falcon-cl1 kernel: (9637,3):dlm_get_lock_resource:986 ERROR:
status = -112
Jan 29 02:27:07 falcon-cl1 kernel: (10233,1):dlm_do_master_request:1409 ERROR:
link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (10233,1):dlm_get_lock_resource:986 ERROR:
status = -112
Jan 29 02:27:07 falcon-cl1 kernel: (9635,3):dlm_do_master_request:1409 ERROR:
link to 1 went down!
Jan 29 02:27:07 falcon-cl1 kernel: (9635,3):dlm_get_lock_resource:986 ERROR:
status = -107
Jan 29 02:27:39 falcon-cl1 kernel:  rport-8:0-2: blocked FC remote port time
out: removing rport
Jan 29 02:27:40 falcon-cl1 kernel: (9273,1):ocfs2_dlm_eviction_cb:108 device
(8,64): dlm has evicted node 1
Jan 29 02:27:40 falcon-cl1 kernel: (3482,1):dlm_get_lock_resource:913
558A175E1CC94570940D9BEFCAE48B8E:O00000000000000022b918c00000000: at least one
node (1) to recover before lock mastery can begin
Jan 29 02:27:41 falcon-cl1 kernel: (3482,1):dlm_get_lock_resource:967
558A175E1CC94570940D9BEFCAE48B8E:O00000000000000022b918c00000000: at least one
node (1) to recover before lock mastery can begin
Jan 29 02:27:42 falcon-cl1 kernel: (10234,1):dlm_restart_lock_mastery:1292
ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (10234,1):dlm_wait_for_lock_mastery:1109
ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (10233,1):dlm_restart_lock_mastery:1292
ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (10233,1):dlm_wait_for_lock_mastery:1109
ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (9637,3):dlm_restart_lock_mastery:1292
ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (9637,3):dlm_wait_for_lock_mastery:1109
ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (9635,3):dlm_restart_lock_mastery:1292
ERROR: node down! 1
Jan 29 02:27:42 falcon-cl1 kernel: (9635,3):dlm_wait_for_lock_mastery:1109
ERROR: status = -11
Jan 29 02:27:42 falcon-cl1 kernel: (3451,3):ocfs2_dlm_eviction_cb:108 device
(8,64): dlm has evicted node 1
Jan 29 02:27:43 falcon-cl1 kernel: (9637,3):dlm_get_lock_resource:967
558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000a7761100000000: at least one
node (1) to recover before lock mastery can begin
Jan 29 02:27:43 falcon-cl1 kernel: (10234,1):dlm_get_lock_resource:967
558A175E1CC94570940D9BEFCAE48B8E:M00000000000000022b919100000000: at least one
node (1) to recover before lock mastery can begin
Jan 29 02:27:43 falcon-cl1 kernel: (10233,3):dlm_get_lock_resource:967
558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000d5bf2100000000: at least one
node (1) to recover before lock mastery can begin
Jan 29 02:27:43 falcon-cl1 kernel: (9635,3):dlm_get_lock_resource:967
558A175E1CC94570940D9BEFCAE48B8E:O0000000000000000b9c6f100000000: at least one
node (1) to recover before lock mastery can begin






Steps to reproduce:


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


More information about the Bugme-new mailing list