[Openais] Question regarding cluster fail-over time and GATHER states
Herwin Kleinjan
herwin.kleinjan at one2many.eu
Fri Mar 12 00:44:34 PST 2010
Hello,
Currently we are looking into possibilities to speed up the fail-over
process on our dual node cluster. This is a RHEL 5.4 cluster running on HP
Proliant servers with iLO based power fencing. For shared storage we use a
fiber based storage array.
There are some parts where fail-over time might be improved, one of them
relating to openais or its configuration. During testing whenever one node
is failing or its power is disconnected, the other node detects this and the
fail-over process is started:
Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] The token was lost in the
OPERATIONAL state.
Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] Transmit multicast socket
send buffer size (288000 bytes).
Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] entering GATHER state from
2.
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering GATHER state from
0.
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Creating commit token
because I am the rep.
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Saving state aru 44 high seq
received 44
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Storing new sequence id for
ring 74
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering COMMIT state.
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering RECOVERY state.
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] position [0] member
10.227.180.101:
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] previous ring seq 112 rep
10.227.180.101
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] aru 44 high delivered 44
received flag 1
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Did not need to originate
any messages in recovery.
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Sending initial ORF token
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] CLM CONFIGURATION CHANGE
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] New Configuration:
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] r(0)
ip(10.227.180.101)
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] Members Left:
Mar 12 08:46:07 donald01 kernel: dlm: closing connection to node 2
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] r(0)
ip(10.227.180.102)
Mar 12 08:46:07 donald01 clurgmgrd[7525]: <info> State change: donald02 DOWN
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] Members Joined:
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] CLM CONFIGURATION CHANGE
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] New Configuration:
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] r(0)
ip(10.227.180.101)
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] Members Left:
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] Members Joined:
Mar 12 08:46:07 donald01 openais[5301]: [SYNC ] This node is within the
primary component and will provide service.
Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering OPERATIONAL state.
Mar 12 08:46:07 donald01 openais[5301]: [CLM ] got nodejoin message
10.227.180.101
Mar 12 08:46:07 donald01 openais[5301]: [CPG ] got joinlist message from
node 1
As you can see from the above /var/log/messages excerpt there is a 5 second
time frame at the beginning in which apparently nothing is happening
(08:46:02-08:46:07). I was wondering how I could reduce or remove this delay
so that the fail-over process will be done more quickly.
My current /etc/ais/openais.conf is still the installed default:
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 192.168.2.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
logging {
debug: off
timestamp: on
}
amf {
mode: disabled
}
More information about the Openais
mailing list