[Openais] checkpoint disappears after node reset
Henry Fung
henryfung_00 at yahoo.com
Wed Sep 26 14:56:10 PDT 2007
Steve,
Not sure if you got the chance to try my parameters.
I am using many sections. Also, I modified the aisexec
to accomodate my big section size and checkpoint size,
and things are working properly.
Henry
SaCkptCheckpointHandleT checkpointHandle = 0;
SaCkptHandleT ckptHandle;
SaVersionT version = { 'B', 1, 1 }; /* Release, Major
Minor */
SaNameT checkpointName = { 11, "switchdata\0" };
SaCkptCheckpointCreationAttributesT
checkpointCreationAttributes = {
/* using SA_CKPT_WR_ACTIVE_REPLICA |
SA_CKPT_CHECKPOINT_COLLOCATED, i.e., async, atomic,
collocated seems to cause partial recovery (reads with
param error) intermittently. */
SA_CKPT_WR_ALL_REPLICAS, /* sync */
100000000, /* checkpointSize, <= maxSections *
maxSectionSize */
// SA_TIME_ONE_SECOND * 300, /*
retentionDuration */
SA_TIME_MAX,
50, /* maxSections */
CKPT_SECTION_SIZE, /* maxSectionSize */
24 /* maxSectionIdSize */
};
#define SECTIONID_SAVCFG {7, (unsigned char
*)"SAVCFG"}
SaCkptSectionIdT savcfgSectionId = SECTIONID_SAVCFG;
SaCkptSectionCreationAttributesT
savcfgSectionCreationAttributes ={
&savcfgSectionId,
SA_TIME_END
};
error = saCkptCheckpointOpen (ckptHandle,
&checkpointName,
&checkpointCreationAttributes,
SA_CKPT_CHECKPOINT_CREATE|SA_CKPT_CHECKPOINT_READ|SA_CKPT_CHECKPOINT_WRITE|SA_CKPT_WR_ALL_REPLICAS,
0,
&checkpointHandle);
--- Steven Dake <sdake at redhat.com> wrote:
> Henry
>
> I'll give it a rerun tonight.
>
> Can you tell me your checkpoint creation parameters?
> Are you using
> sections? Can you give me your timeouts on the
> expiration of the
> checkpoints?
>
> Regards
> -steve
> On Tue, 2007-09-25 at 15:22 -0700, Henry Fung wrote:
> > Steve,
> > I used 0.80.2 and moving to 0.80.3 doe not help.
> > There is no core dump. The problem always happens
> when
> > the first member of the ring drops dead (thru a
> system
> > init 6, e.g.). I tried various hacks with no
> prevail
> > including:
> > 1. open the checkpoint on the standby node early
> > 2. use SA_TIME_END
> > Reading of the checkpoint on the standby may
> succeed
> > for a short while until the writing node is
> completely
> > gone. Then, there is either the error code 2 (or
> 6) on
> > further reads. Therefore, I am sure things are
> working
> > properly prior to the node reset.
> >
> > My guess is something to do with the selecting the
> > ring representative preferring the lowest node id
> and
> > the standby node does not become the rep soon
> enough
> > after the rep drops dead, or the new rep somehow
> > deletes the checkpoint during some sync stage.
> >
>
>
____________________________________________________________________________________
Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
http://farechase.yahoo.com/
More information about the Openais
mailing list