[Openais] Question regarding cluster fail-over time and GATHER states

Herwin Kleinjan herwin.kleinjan at one2many.eu
Fri Mar 12 00:44:34 PST 2010


Hello,

Currently we are looking into possibilities to speed up the fail-over
process on our dual node cluster. This is a RHEL 5.4 cluster running on HP
Proliant servers with iLO based power fencing. For shared storage we use a
fiber based storage array. 

There are some parts where fail-over time might be improved, one of them
relating to openais or its configuration. During testing whenever one node
is failing or its power is disconnected, the other node detects this and the
fail-over process is started:

Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] The token was lost in the
OPERATIONAL state. 
Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes). 
Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] Transmit multicast socket
send buffer size (288000 bytes). 
Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] entering GATHER state from
2. 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering GATHER state from
0. 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Creating commit token
because I am the rep. 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Saving state aru 44 high seq
received 44 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Storing new sequence id for
ring 74 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering COMMIT state. 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering RECOVERY state. 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] position [0] member
10.227.180.101: 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] previous ring seq 112 rep
10.227.180.101 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] aru 44 high delivered 44
received flag 1 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Did not need to originate
any messages in recovery. 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Sending initial ORF token 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] CLM CONFIGURATION CHANGE 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] New Configuration: 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ]         r(0)
ip(10.227.180.101)  
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Left: 
Mar 12 08:46:07 donald01 kernel: dlm: closing connection to node 2
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ]         r(0)
ip(10.227.180.102)  
Mar 12 08:46:07 donald01 clurgmgrd[7525]: <info> State change: donald02 DOWN

Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Joined: 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] CLM CONFIGURATION CHANGE 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] New Configuration: 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ]         r(0)
ip(10.227.180.101)  
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Left: 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Joined: 
Mar 12 08:46:07 donald01 openais[5301]: [SYNC ] This node is within the
primary component and will provide service. 
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering OPERATIONAL state. 
Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] got nodejoin message
10.227.180.101 
Mar 12 08:46:07 donald01 openais[5301]: [CPG  ] got joinlist message from
node 1


As you can see from the above /var/log/messages excerpt there is a 5 second
time frame at the beginning in which apparently nothing is happening
(08:46:02-08:46:07). I was wondering how I could reduce or remove this delay
so that the fail-over process will be done more quickly.

My current /etc/ais/openais.conf is still the installed default:
totem {
	version: 2
	secauth: off
	threads: 0
	interface {
		ringnumber: 0
		bindnetaddr: 192.168.2.0
		mcastaddr: 226.94.1.1
		mcastport: 5405
	}
}

logging {
	debug: off
	timestamp: on
}

amf {
	mode: disabled
}




More information about the Openais mailing list