[Openais] CKPT: bug, ckpt_id not synced
Steven Dake
sdake at redhat.com
Wed Sep 6 06:09:34 PDT 2006
Hans,
I have been working on this problem for awhile. It goes deeper then
this unfortunately but this has been in my working patch for awhile.
Synchronization of checkpoints since the unlink changes is not
guaranteed to work 100% I think even with this patch.
Yes I take care of the whitetank and picacho branches.
Regards
-steve
On Wed, 2006-09-06 at 07:55 +0200, Hans Feldt wrote:
> Committed revision 1238.
>
> The bug seems to exist in the whitetank branch as well. Steven takes
> care of that?
>
> The test case below was impossible to run before the fix, I don't
> understand why noone haven't seen this problem before?
>
> Regards,
> Hans
>
> Hans Feldt wrote:
> >
> > Here's my test case:
> > - start first node
> > - create checkpoint, write random binary data to it (default section or
> > a new section) and compute MD5 sum for data.
> > - start 2nd node
> > - read checkpoint & check data integrity with MD5 sum on 2nd node
> > - kill first node & start it again
> > - read checkpoint & check data integrity with MD5 sum on first node
> >
> > It would be good with some CKPT regression testing like the above but
> > more of everything (nodes, writers, readers, checkpoints, sections, data
> > size, synchronizations).
> >
> > Regards,
> > Hans
> >
> >
> > ------------------------------------------------------------------------
> >
> > Index: ckpt.c
> > ===================================================================
> > --- ckpt.c (revision 1237)
> > +++ ckpt.c (working copy)
> > @@ -1005,6 +1005,7 @@
> > &checkpoint_section->section_descriptor,
> > sizeof(mar_ckpt_section_descriptor_t));
> >
> > + request_exec_sync_state.ckpt_id = checkpoint->ckpt_id;
> > request_exec_sync_state.nodeid = this_ip->nodeid;
> >
> > for (i = 0; i < PROCESSOR_COUNT_MAX; i++) {
> > @@ -1017,10 +1018,12 @@
> >
> > log_printf (LOG_LEVEL_DEBUG, "New Sync State Message Values\n");
> > for (i = 0; i < PROCESSOR_COUNT_MAX; i ++) {
> > - log_printf (LOG_LEVEL_DEBUG,"Index %d has proc %s and count %d\n",
> > - i,
> > - totempg_ifaces_print (request_exec_sync_state.ckpt_refcnt[i].nodeid),
> > - request_exec_sync_state.ckpt_refcnt[i].count);
> > + if (request_exec_sync_state.ckpt_refcnt[i].nodeid) {
> > + log_printf (LOG_LEVEL_DEBUG,"Index %d has proc %s and count %d\n",
> > + i,
> > + totempg_ifaces_print (request_exec_sync_state.ckpt_refcnt[i].nodeid),
> > + request_exec_sync_state.ckpt_refcnt[i].count);
> > + }
> > }
> >
> > iovecs[0].iov_base = (char *)&request_exec_sync_state;
> > @@ -1108,6 +1111,8 @@
> > iovecs[2].iov_base = ((char*)checkpoint_section->section_data + recovery_section_data_offset);
> > iovecs[2].iov_len = newSectionSize;
> > request_exec_sync_section.header.size += iovecs[2].iov_len;
> > + request_exec_sync_section.ckpt_id = checkpoint->ckpt_id;
> > +
> > /*
> > * Check to see if we can queue the new message and if you can
> > * then mcast the message else break and create callback.
> > @@ -2014,10 +2019,12 @@
> > log_printf (LOG_LEVEL_DEBUG, "recovery_checkpoint_open %s\n", checkpoint_name->value);
> > log_printf (LOG_LEVEL_DEBUG, "recovery_checkpoint_open refcnt Values\n");
> > for (i = 0; i < PROCESSOR_COUNT_MAX; i ++) {
> > - log_printf (LOG_LEVEL_DEBUG,"Index %d has proc %s and count %d\n",
> > - i,
> > - totempg_ifaces_print (ref_cnt[i].nodeid),
> > - ref_cnt[i].count);
> > + if (ref_cnt[i].nodeid) {
> > + log_printf (LOG_LEVEL_DEBUG,"Index %d has proc %s and count %d\n",
> > + i,
> > + totempg_ifaces_print (ref_cnt[i].nodeid),
> > + ref_cnt[i].count);
> > + }
> > }
> >
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Openais mailing list
> > Openais at lists.osdl.org
> > https://lists.osdl.org/mailman/listinfo/openais
>
> _______________________________________________
> Openais mailing list
> Openais at lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/openais
More information about the Openais
mailing list