[Openais] Corosync UDP ports

Mon Mar 15 02:53:55 PDT 2010

Hi All,

in a test that we started last week we have two Pacemaker+Corosync
clusters, each with three hosts, where all six hosts are on the same
network(s). The two clusters are identically configured, with one
execption: the mcastport is 688 for one, and 689 for the other.

This morning I found the clusters in a strange state, none of the
hosts could see any of the others, i.e. Pacemaker output was "as if"
Corosync wasn't running on the other  nodes, although the network was
fine, as I could easily verify with a ping etc.

I then noticed in the lsof output that Corosync seems to also use the
port below the configured mcastport, which leads me to my questions:

Is this normal? It doesn't seem to be documented in
http://corosync.org/doku.php?id=faq:configure_openais and
corosync.conf(5).
Is this overlap created by the additional port a likely cause for the
cluster conking out?

Thanks, Colin

PS: I'm in the process of trying to revive the cluster;
/etc/init.d/corosync stop didn't work, but a few "kill -9" and "rm -f
/var/lib/heartbeat/crm/*" commands later I'm up-and-running again on
2x2 of the 2x3 nodes with the same config as previously, looking fine
so far...

root at h001:~# dpkg -l | grep corosync
ii  corosync
1.2.0-0ubuntu1                                  Standards-based
cluster framework (daemon an
ii  libcorosync4
1.2.0-0ubuntu1                                  Standards-based
cluster framework (libraries
root at h001:~# cat /etc/corosync/corosync.conf
totem {
        version: 2
        consensus: 1500
        vsftype: none
        clear_node_high_bit: yes
        secauth: off
        threads: 0
        rrp_mode: passive
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.50.32
                broadcast: yes
                mcastport: 688   <=== 689 for the other cluster
        }
        interface {
                ringnumber: 1
                bindnetaddr: 192.168.52.32
                broadcast: yes
                mcastport: 688   <=== 689 for the other cluster
        }
}
amf {
        mode: disabled
}
service {
        ver:       0
        name:      pacemaker
}
aisexec {
        user:   root
        group:  root
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
        syslog_facility: daemon
        debug: on
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}
root at h001:~# lsof -n | grep corosync | grep UDP
corosync  17688        root    5u     IPv4              89563      0t0
       UDP 255.255.255.255:688
corosync  17688        root    6u     IPv4              89564      0t0
       UDP 192.168.50.40:687
corosync  17688        root    7u     IPv4              89565      0t0
       UDP 192.168.50.40:688
corosync  17688        root    8u     IPv4              89612      0t0
       UDP 255.255.255.255:688
corosync  17688        root    9u     IPv4              89613      0t0
       UDP 192.168.52.40:687
corosync  17688        root   10u     IPv4              89614      0t0
       UDP 192.168.52.40:688
root at h001:~#