[Openais] checkpoint disappears after node reset

Henry Fung henryfung_00 at yahoo.com
Wed Sep 26 14:56:10 PDT 2007


Steve,
Not sure if you got the chance to try my parameters.
I am using many sections. Also, I modified the aisexec
to accomodate my big section size and checkpoint size,
and things are working properly.
Henry

SaCkptCheckpointHandleT checkpointHandle = 0;
SaCkptHandleT ckptHandle;
SaVersionT version = { 'B', 1, 1 }; /* Release, Major
Minor */
SaNameT checkpointName = { 11, "switchdata\0" };
SaCkptCheckpointCreationAttributesT
checkpointCreationAttributes = {
        /* using SA_CKPT_WR_ACTIVE_REPLICA |
SA_CKPT_CHECKPOINT_COLLOCATED, i.e., async, atomic,
collocated seems to cause partial recovery (reads with
param error) intermittently. */
        SA_CKPT_WR_ALL_REPLICAS, /* sync */
        100000000, /* checkpointSize, <= maxSections *
maxSectionSize */
//        SA_TIME_ONE_SECOND * 300, /*
retentionDuration */
        SA_TIME_MAX,
        50, /* maxSections */
        CKPT_SECTION_SIZE, /* maxSectionSize */
        24 /* maxSectionIdSize */
};
#define SECTIONID_SAVCFG {7, (unsigned char
*)"SAVCFG"}
SaCkptSectionIdT savcfgSectionId = SECTIONID_SAVCFG;
SaCkptSectionCreationAttributesT
savcfgSectionCreationAttributes ={
        &savcfgSectionId,
        SA_TIME_END
};

 error = saCkptCheckpointOpen (ckptHandle,
                                &checkpointName,
                               
&checkpointCreationAttributes,
                            
SA_CKPT_CHECKPOINT_CREATE|SA_CKPT_CHECKPOINT_READ|SA_CKPT_CHECKPOINT_WRITE|SA_CKPT_WR_ALL_REPLICAS,
                                0,
                                &checkpointHandle);


--- Steven Dake <sdake at redhat.com> wrote:

> Henry
> 
> I'll give it a rerun tonight.
> 
> Can you tell me your checkpoint creation parameters?
>  Are you using
> sections?  Can you give me your timeouts on the
> expiration of the
> checkpoints?
> 
> Regards
> -steve
> On Tue, 2007-09-25 at 15:22 -0700, Henry Fung wrote:
> > Steve,
> > I used 0.80.2 and moving to 0.80.3 doe not help.
> > There is no core dump. The problem always happens
> when
> > the first member of the ring drops dead (thru a
> system
> > init 6, e.g.). I tried various hacks with no
> prevail
> > including:
> > 1. open the checkpoint on the standby node early
> > 2. use SA_TIME_END
> > Reading of the checkpoint on the standby may
> succeed
> > for a short while until the writing node is
> completely
> > gone. Then, there is either the error code 2 (or
> 6) on
> > further reads. Therefore, I am sure things are
> working
> > properly prior to the node reset.
> > 
> > My guess is something to do with the selecting the
> > ring representative preferring the lowest node id
> and
> > the standby node does not become the rep soon
> enough
> > after the rep drops dead, or the new rep somehow
> > deletes the checkpoint during some sync stage.
> > 
> 
> 



       
____________________________________________________________________________________
Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
http://farechase.yahoo.com/


More information about the Openais mailing list