[Openais] checkpoint disappears after node reset
Henry Fung
henryfung_00 at yahoo.com
Wed Sep 26 15:45:48 PDT 2007
Sorry that I confused you. No, the problem of the
lowest node id rep leaving leading to loss of
checkpoint still exists; since you asked for the
parameters I was using, so I provided them. My
parameters, however, may cause you grief if you run
them on your aisexec because I had done some twisting
to mine to make it support very large section size. I
believe the problem has nothing to do with section
size; it may have something to do with using more than
one sections; it is more likely to be rep and sync
stuffs.
This is working case:
Jul 15 23:07:51.836461 [CLM ] CLM CONFIGURATION
CHANGE
Jul 15 23:07:51.836518 [CLM ] New Configuration:
Jul 15 23:07:51.836645 [CLM ] r(0) ip(192.168.18.1)
Jul 15 23:07:51.836777 [CLM ] r(0) ip(192.168.18.2)
Jul 15 23:07:51.837220 [CLM ] Members Left:
Jul 15 23:07:51.837291 [CLM ] Members Joined:
Jul 15 23:07:51.837370 [CLM ] r(0) ip(192.168.18.2)
Jul 15 23:07:51.837455 [SYNC ] This node is within the
primary component and will provide service.
Jul 15 23:07:51.837624 [TOTEM] entering OPERATIONAL
state.
Jul 15 23:07:51.839356 [CLM ] got nodejoin message
192.168.18.1
Jul 15 23:07:51.840166 [CLM ] got nodejoin message
192.168.18.2
Jul 15 23:09:24.719179 [TOTEM] entering GATHER state
from 12. <==== after switchover, we got the following
messages
Jul 15 23:09:24.770819 [TOTEM] Creating commit token
because I am the rep. <====== distinct message
Jul 15 23:09:24.770901 [TOTEM] Saving state aru 1c63b
high seq received 1c63b
Jul 15 23:09:24.771051 [TOTEM] entering COMMIT state.
When it is not working,
Jun 12 1:03:24.345311 [CLM ] r(0) ip(192.168.18.1)
Jun 12 1:03:24.345396 [SYNC ] This node is within the
primary component and will provide service.
Jun 12 1:03:24.345584 [TOTEM] entering OPERATIONAL
state.
Jun 12 1:03:24.350664 [CLM ] got nodejoin message
192.168.18.1
Jun 12 1:03:24.351017 [CLM ] got nodejoin message
192.168.18.2
Jun 12 1:04:08.008984 [TOTEM] entering GATHER state
from 12. ç===== after switchover
Jun 12 1:04:08.111724 [TOTEM] Saving state aru 3965
high seq received 3965
Jun 12 1:04:08.111905 [TOTEM] entering COMMIT state.
Jun 12 1:04:08.114907 [TOTEM] entering RECOVERY
state.
Jun 12 1:04:08.115087 [TOTEM] position [0] member
192.168.18.1:
Jun 12 1:04:08.115159 [TOTEM] previous ring seq 4 rep
192.168.18.1
Jun 12 1:04:08.115218 [TOTEM] aru 9 high delivered 9
received flag 0
Jun 12 1:04:08.115295 [TOTEM] position [1] member
192.168.18.2:
Jun 12 1:04:08.115360 [TOTEM] previous ring seq 44
rep 192.168.18.1
Jun 12 1:04:08.115418 [TOTEM] aru 3965 high delivered
3965 received flag 0
Jun 12 1:04:08.115492 [TOTEM] Did not need to
originate any messages in recovery.
Jun 12 1:04:08.115602 [TOTEM] Storing new sequence id
for ring 30
Jun 12 1:04:08.126962 [CLM ] CLM CONFIGURATION
CHANGE
Jun 12 1:04:08.127042 [CLM ] New Configuration:
Jun 12 1:04:08.127193 [CLM ] r(0) ip(192.168.18.1)
Jun 12 1:04:08.127332 [CLM ] r(0) ip(192.168.18.2)
Jun 12 1:04:08.128538 [CLM ] Members Left:
Jun 12 1:04:08.128622 [CLM ] Members Joined:
Jun 12 1:04:08.128713 [SYNC ] This node is within the
primary component and will provide service.
Jun 12 1:04:08.128815 [CLM ] CLM CONFIGURATION
CHANGE
Jun 12 1:04:08.128871 [CLM ] New Configuration:
Jun 12 1:04:08.129017 [CLM ] r(0) ip(192.168.18.1)
Jun 12 1:04:08.129148 [CLM ] r(0) ip(192.168.18.2)
Jun 12 1:04:08.129736 [CLM ] Members Left:
Jun 12 1:04:08.129812 [CLM ] Members Joined:
Jun 12 1:04:08.129904 [SYNC ] This node is within the
primary component and will provide service.
Jun 12 1:04:08.130085 [TOTEM] entering OPERATIONAL
state.
Jun 12 1:04:08.158486 [CLM ] got nodejoin message
192.168.18.1
Jun 12 1:04:08.158825 [CLM ] got nodejoin message
192.168.18.2
Jun 12 1:04:09.159900 [CKPT ] sync_refcount_increment
cnt 1
Jun 12 1:04:13.149899 [TOTEM] The token was lost in
the OPERATIONAL state. <====== this comes very late
Jun 12 1:04:13.159180 [TOTEM] Receive multicast
socket recv buffer size (262142 bytes).
Jun 12 1:04:13.159256 [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
Jun 12 1:04:13.166017 [TOTEM] entering GATHER state
from 2.
Jun 12 1:04:13.967010 [TOTEM] entering GATHER state
from 0.
Jun 12 1:04:13.967155 [TOTEM] Creating commit token
because I am the rep. <======== this comes very late
Jun 12 1:04:13.967237 [TOTEM] Saving state aru 2fa0
high seq received 2fa0
Henry
--- Steven Dake <sdake at redhat.com> wrote:
> So the root of the problem was that the max message
> size is set to 1 mB?
>
> By changing this parameter checkpoints are now
> working properly for you?
>
> Regards
> -steve
> On Wed, 2007-09-26 at 14:56 -0700, Henry Fung wrote:
> > Steve,
> > Not sure if you got the chance to try my
> parameters.
> > I am using many sections. Also, I modified the
> aisexec
> > to accomodate my big section size and checkpoint
> size,
> > and things are working properly.
> > Henry
> >
> > SaCkptCheckpointHandleT checkpointHandle = 0;
> > SaCkptHandleT ckptHandle;
> > SaVersionT version = { 'B', 1, 1 }; /* Release,
> Major
> > Minor */
> > SaNameT checkpointName = { 11, "switchdata\0" };
> > SaCkptCheckpointCreationAttributesT
> > checkpointCreationAttributes = {
> > /* using SA_CKPT_WR_ACTIVE_REPLICA |
> > SA_CKPT_CHECKPOINT_COLLOCATED, i.e., async,
> atomic,
> > collocated seems to cause partial recovery (reads
> with
> > param error) intermittently. */
> > SA_CKPT_WR_ALL_REPLICAS, /* sync */
> > 100000000, /* checkpointSize, <=
> maxSections *
> > maxSectionSize */
>
>
____________________________________________________________________________________
Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7
More information about the Openais
mailing list