[Openais] checkpoint disappears after node reset

Henry Fung henryfung_00 at yahoo.com
Wed Sep 26 15:45:48 PDT 2007


Sorry that I confused you. No, the problem of the
lowest node id rep leaving leading to loss of
checkpoint still exists; since you asked for the
parameters I was using, so I provided them. My
parameters, however, may cause you grief if you run
them on your aisexec because I had done some twisting
to mine to make it support very large section size. I
believe the problem has nothing to do with section
size; it may have something to do with using more than
one sections; it is more likely to be rep and sync
stuffs.

This is working case:

Jul 15 23:07:51.836461 [CLM  ] CLM CONFIGURATION
CHANGE

Jul 15 23:07:51.836518 [CLM  ] New Configuration:

Jul 15 23:07:51.836645 [CLM  ]  r(0) ip(192.168.18.1)

Jul 15 23:07:51.836777 [CLM  ]  r(0) ip(192.168.18.2)

Jul 15 23:07:51.837220 [CLM  ] Members Left:

Jul 15 23:07:51.837291 [CLM  ] Members Joined:

Jul 15 23:07:51.837370 [CLM  ]  r(0) ip(192.168.18.2)

Jul 15 23:07:51.837455 [SYNC ] This node is within the
primary component and will provide service.

Jul 15 23:07:51.837624 [TOTEM] entering OPERATIONAL
state.

Jul 15 23:07:51.839356 [CLM  ] got nodejoin message
192.168.18.1

Jul 15 23:07:51.840166 [CLM  ] got nodejoin message
192.168.18.2

Jul 15 23:09:24.719179 [TOTEM] entering GATHER state
from 12.  <==== after switchover, we got the following
messages

Jul 15 23:09:24.770819 [TOTEM] Creating commit token
because I am the rep. <====== distinct message

Jul 15 23:09:24.770901 [TOTEM] Saving state aru 1c63b
high seq received 1c63b

Jul 15 23:09:24.771051 [TOTEM] entering COMMIT state.

When it is not working,

Jun 12  1:03:24.345311 [CLM  ]  r(0) ip(192.168.18.1)

Jun 12  1:03:24.345396 [SYNC ] This node is within the
primary component and will provide service.

Jun 12  1:03:24.345584 [TOTEM] entering OPERATIONAL
state.

Jun 12  1:03:24.350664 [CLM  ] got nodejoin message
192.168.18.1

Jun 12  1:03:24.351017 [CLM  ] got nodejoin message
192.168.18.2

Jun 12  1:04:08.008984 [TOTEM] entering GATHER state
from 12. ç===== after switchover

Jun 12  1:04:08.111724 [TOTEM] Saving state aru 3965
high seq received 3965

Jun 12  1:04:08.111905 [TOTEM] entering COMMIT state.

Jun 12  1:04:08.114907 [TOTEM] entering RECOVERY
state.

Jun 12  1:04:08.115087 [TOTEM] position [0] member
192.168.18.1:

Jun 12  1:04:08.115159 [TOTEM] previous ring seq 4 rep
192.168.18.1

Jun 12  1:04:08.115218 [TOTEM] aru 9 high delivered 9
received flag 0

Jun 12  1:04:08.115295 [TOTEM] position [1] member
192.168.18.2:

Jun 12  1:04:08.115360 [TOTEM] previous ring seq 44
rep 192.168.18.1

Jun 12  1:04:08.115418 [TOTEM] aru 3965 high delivered
3965 received flag 0

Jun 12  1:04:08.115492 [TOTEM] Did not need to
originate any messages in recovery.

Jun 12  1:04:08.115602 [TOTEM] Storing new sequence id
for ring 30

Jun 12  1:04:08.126962 [CLM  ] CLM CONFIGURATION
CHANGE

Jun 12  1:04:08.127042 [CLM  ] New Configuration:

Jun 12  1:04:08.127193 [CLM  ]  r(0) ip(192.168.18.1)

Jun 12  1:04:08.127332 [CLM  ]  r(0) ip(192.168.18.2)

Jun 12  1:04:08.128538 [CLM  ] Members Left:

Jun 12  1:04:08.128622 [CLM  ] Members Joined:

Jun 12  1:04:08.128713 [SYNC ] This node is within the
primary component and will provide service.

Jun 12  1:04:08.128815 [CLM  ] CLM CONFIGURATION
CHANGE

Jun 12  1:04:08.128871 [CLM  ] New Configuration:

Jun 12  1:04:08.129017 [CLM  ]  r(0) ip(192.168.18.1)

Jun 12  1:04:08.129148 [CLM  ]  r(0) ip(192.168.18.2)

Jun 12  1:04:08.129736 [CLM  ] Members Left:

Jun 12  1:04:08.129812 [CLM  ] Members Joined:

Jun 12  1:04:08.129904 [SYNC ] This node is within the
primary component and will provide service.

Jun 12  1:04:08.130085 [TOTEM] entering OPERATIONAL
state.

Jun 12  1:04:08.158486 [CLM  ] got nodejoin message
192.168.18.1

Jun 12  1:04:08.158825 [CLM  ] got nodejoin message
192.168.18.2

Jun 12  1:04:09.159900 [CKPT ] sync_refcount_increment
cnt 1

Jun 12  1:04:13.149899 [TOTEM] The token was lost in
the OPERATIONAL state. <====== this comes very late

Jun 12  1:04:13.159180 [TOTEM] Receive multicast
socket recv buffer size (262142 bytes).

Jun 12  1:04:13.159256 [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).

Jun 12  1:04:13.166017 [TOTEM] entering GATHER state
from 2.

Jun 12  1:04:13.967010 [TOTEM] entering GATHER state
from 0.

Jun 12  1:04:13.967155 [TOTEM] Creating commit token
because I am the rep. <======== this comes very late

Jun 12  1:04:13.967237 [TOTEM] Saving state aru 2fa0
high seq received 2fa0



Henry

--- Steven Dake <sdake at redhat.com> wrote:

> So the root of the problem was that the max message
> size is set to 1 mB?
> 
> By changing this parameter checkpoints are now
> working properly for you?
> 
> Regards
> -steve
> On Wed, 2007-09-26 at 14:56 -0700, Henry Fung wrote:
> > Steve,
> > Not sure if you got the chance to try my
> parameters.
> > I am using many sections. Also, I modified the
> aisexec
> > to accomodate my big section size and checkpoint
> size,
> > and things are working properly.
> > Henry
> > 
> > SaCkptCheckpointHandleT checkpointHandle = 0;
> > SaCkptHandleT ckptHandle;
> > SaVersionT version = { 'B', 1, 1 }; /* Release,
> Major
> > Minor */
> > SaNameT checkpointName = { 11, "switchdata\0" };
> > SaCkptCheckpointCreationAttributesT
> > checkpointCreationAttributes = {
> >         /* using SA_CKPT_WR_ACTIVE_REPLICA |
> > SA_CKPT_CHECKPOINT_COLLOCATED, i.e., async,
> atomic,
> > collocated seems to cause partial recovery (reads
> with
> > param error) intermittently. */
> >         SA_CKPT_WR_ALL_REPLICAS, /* sync */
> >         100000000, /* checkpointSize, <=
> maxSections *
> > maxSectionSize */ 
> 
> 



      ____________________________________________________________________________________
Shape Yahoo! in your own image.  Join our Network Research Panel today!   http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 




More information about the Openais mailing list