[Openais] Question regarding cluster fail-over time and GATHER states

Mon Mar 15 02:45:56 PDT 2010

> -----Original Message-----
> From: Herwin Kleinjan [mailto:herwin.kleinjan at one2many.eu]
> Sent: maandag 15 maart 2010 9:37
> To: sdake at redhat.com
> Cc: openais at lists.osdl.org
> Subject: RE: [Openais] Question regarding cluster fail-over time and
GATHER
> states
> 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at redhat.com]
> > Sent: vrijdag 12 maart 2010 17:19
> > To: Herwin Kleinjan
> > Cc: openais at lists.osdl.org
> > Subject: Re: [Openais] Question regarding cluster fail-over time and
> GATHER
> > states
> >
> > On Fri, 2010-03-12 at 09:44 +0100, Herwin Kleinjan wrote:
> > > Hello,
> > >
> > > Currently we are looking into possibilities to speed up the fail-over
> > > process on our dual node cluster. This is a RHEL 5.4 cluster running
on
> HP
> > > Proliant servers with iLO based power fencing. For shared storage we
use
> a
> > > fiber based storage array.
> > >
> > > There are some parts where fail-over time might be improved, one of
them
> > > relating to openais or its configuration. During testing whenever one
> node
> > > is failing or its power is disconnected, the other node detects this
and
> the
> > > fail-over process is started:
> > >
> > > Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] The token was lost in
> the
> > > OPERATIONAL state.
> > > Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] Receive multicast
socket
> > > recv buffer size (288000 bytes).
> > > Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] Transmit multicast
> socket
> > > send buffer size (288000 bytes).
> > > Mar 12 08:46:02 donald01 openais[5301]: [TOTEM] entering GATHER state
> from
> > > 2.
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering GATHER state
> from
> > > 0.
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Creating commit token
> > > because I am the rep.
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Saving state aru 44
high
> seq
> > > received 44
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Storing new sequence
id
> for
> > > ring 74
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering COMMIT state.
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering RECOVERY
state.
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] position [0] member
> > > 10.227.180.101:
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] previous ring seq 112
> rep
> > > 10.227.180.101
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] aru 44 high delivered
44
> > > received flag 1
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Did not need to
> originate
> > > any messages in recovery.
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] Sending initial ORF
> token
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] CLM CONFIGURATION
CHANGE
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] New Configuration:
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ]         r(0)
> > > ip(10.227.180.101)
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Left:
> > > Mar 12 08:46:07 donald01 kernel: dlm: closing connection to node 2
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ]         r(0)
> > > ip(10.227.180.102)
> > > Mar 12 08:46:07 donald01 clurgmgrd[7525]: <info> State change:
donald02
> DOWN
> > >
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Joined:
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] CLM CONFIGURATION
CHANGE
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] New Configuration:
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ]         r(0)
> > > ip(10.227.180.101)
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Left:
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] Members Joined:
> > > Mar 12 08:46:07 donald01 openais[5301]: [SYNC ] This node is within
the
> > > primary component and will provide service.
> > > Mar 12 08:46:07 donald01 openais[5301]: [TOTEM] entering OPERATIONAL
> state.
> > > Mar 12 08:46:07 donald01 openais[5301]: [CLM  ] got nodejoin message
> > > 10.227.180.101
> > > Mar 12 08:46:07 donald01 openais[5301]: [CPG  ] got joinlist message
> from
> > > node 1
> > >
> > >
> > > As you can see from the above /var/log/messages excerpt there is a 5
> second
> > > time frame at the beginning in which apparently nothing is happening
> > > (08:46:02-08:46:07). I was wondering how I could reduce or remove this
> delay
> > > so that the fail-over process will be done more quickly.
> > >
> > > My current /etc/ais/openais.conf is still the installed default:
> > > totem {
> > > 	version: 2
> > > 	secauth: off
> > > 	threads: 0
> > > 	interface {
> > > 		ringnumber: 0
> > > 		bindnetaddr: 192.168.2.0
> > > 		mcastaddr: 226.94.1.1
> > > 		mcastport: 5405
> > > 	}
> > > }
> > >
> > > logging {
> > > 	debug: off
> > > 	timestamp: on
> > > }
> > >
> > > amf {
> > > 	mode: disabled
> > > }
> > >
> > >
> > > From /etc/cluster/cluster.conf some lines relating to openais:
> > > <cman expected_votes="1" two_node="1" hello_timer="1"
> deadnode_timeout="3"/>
> > > <totem token="3000"/>
> > >
> > > I am suspecting more fine tuning of additional openais configuration
> > > parameters will do the trick but I am not sure. If more information is
> > > needed please let me know, any useful advice would be greatly
> appreciated!
> > >
> > > Best regards,
> > > Herwin
> > >
> > When a cluster is started with cman, /etc/ais/openais.conf is not used.
> >
> > While changing timing parameters is not supported by Red Hat, they can
> > be modified via overrides.  The 5 second time window you see is a result
> > of the "consensus" timeout parameter.
> >
> > consensus should be at minimum 2* token.
> >
> > You might try
> > <totem token="500" consensus="1500" retransmits_before_loss_const="8"/>
> >
> > With qdisk, there may be other implications on timer settings.
> >
> > You may have to override retransmits_before_loss_const as well.  The
> > token timeout is divided by retrans_before_loss and used to calculate
> > the token_retransmit parameter.  The smallest value for any timer can be
> > 30 milliseconds because of limitations of the Linux timer
> > implementation.
> >
> > I believe retransmits_before_loss_const is something like 20 for cman,
> > so in this case 500/20 = 25 msec (less then 30 msec) which will cause
> > aisexec to fail to start.
> >
> > A safe value for retransmits_before_loss might be 8-10.
> >
> > Again, not supported by Red Hat support, and YMMV.
> >
> > Let us know how it goes.
> >
> > Timer parameters need to be the same on all nodes.
> >
> > Regards
> > -steve
> > >
> > > _______________________________________________
> > > Openais mailing list
> > > Openais at lists.linux-foundation.org
> > > https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> Thanks for your reply Steve!
> 
> I have followed your recommendations and ended up with the following
> configuration in /etc/cluster/cluster.conf:
> 
>         <cman expected_votes="1" two_node="1" hello_timer="1"
> deadnode_timeout="3"/>
>         <logging syslog_facility="local4">
>                 <logger ident="CPG" debug="on"/>
>                 <logger ident="CMAN" debug="on"/>
>         </logging>
>         <totem token="500" consensus="1000" retransmits_before_loss="10"/>
> 
> And this seems to work. Failover time is now about 16-20 seconds, this
> includes fencing the failed node, assigning IP addresses and mounting the
> filesystem required by the cluster service. However, I had expected to see
> more logging in /var/log/messages because of the options in <logging/> but
> it is the same as without these options...
> 
> While I was adding the consensus parameter and thought I little more about
> the token parameter I was wondering how these timers relate to the
> hello_timer
> and deadnode_timeout parameters in the <cman /> line. I have googled quite
> extensively but could not find any detailed documentation/explanation on
the
> way that these parameters could/should be configured for different use
> cases.
> Any recommended reading (either online or in a book) that you could
> recommend
> would also be appreciated.
> 
> Best regards,
> Herwin

*Doh* please ignore the remark on logging, I forgot to restart the syslog
facility...

What I did see now though is that cluster service monitoring scripts are
only called once every 10 secs. I tried to lower that to 5 secs (as that
would be the minimum allowed value) by changing parameters in ip.sh, fs.sh
and script.sh in /usr/share/cluster, but somehow it won't accept these new
values and checking remains done every 10 secs... Suggestions anyone?

Best regards,
Herwin