[Openais] Logs during reconfiguration (node lost)

Steven Dake sdake at mvista.com
Tue Feb 22 11:35:42 PST 2005


On Mon, 2005-02-21 at 10:19, Kristen Smith wrote:
> Hi Steve,
> 
> We had some traffic running this weekend (5+1) and one of the nodes
> died (the same aisexec: ../include/sq.h:152: sq_item_get: Assertion
> `sq_position >= 0' failed. that is already reported). In looking
> through the logs when this happened, I am confused about something and
> maybe you can clear this up for me.
> 
> We had 6 nodes (47.104.22.82 - 47.104.22.87) - the failure occurred on
> .84. The reconfig looks the same on 4 of the remaining nodes and
> different on another one. The logs are shown below. 
> 
> My questions are:
> 
> 1) why do all but .86 think that .84 AND .86 went away - .84 died, so
> that makes sense, but why .86 as well?
> 2) why does .86 think all other nodes went away and it is all by
> itself?
> 3) both .82 and .86 think they are the rep and create new commit
> tokens - I guess this is because .86 thinks it is in a cluster by
> itself and .82 was the original rep.
> 

Kristen
I saw this problem a few weeks ago...  I spent about a day debugging it
and came to the conclusion that either the algorithm is flawed in this
respect, or there is some bug in the membership algorithm.  Basically
what happens is that processor 86 thinks all other processors are failed
(except the local processor), while the rest of the processors have
reached a new consensus.  More specifically when all processors are
failed, memb_consensus_agreed() returns true.  It does this because the
local processor's consensus bit is set, but after subtracting the proc
set from the failed set, there is only the local processor.  Hence, only
one processor in the consensus set = consensus is reached.  Since the
lowest configuration id is the local processor, this causes a commit
token to be created by memb_state_commit_token_create().  What makes
this even worse is that the rest of the processors think 86 is valid
(not in the failed set but in the proc set), so they send the  commit
token to it.     Hence, they both enter the commit state but they are
not in consensus.  This should be looked at in more detail, but right
now its pretty low priority.

This was one cause of the commit token assert we were seeing earlier.  I
fixed this assert by ensuring that the entire ring id, instead of just
the ring seq no is used when determining to accept the commit token.  So
the above case is fixed, but this probably just masks the original
problem.

I believe we are ok for now until I have more cycles to look at this
problem in detail.

Thanks
-steve


> Also, this is just the beginning of the reconfiguration at this time -
> all nodes do multiple reconfigurations after this one caused by the
> failure. I can send all logs along later if you want. Eventually
> (within a second or so after this initial reconfig), all the nodes
> wind up seeing each other and the ring is reformed in a 5+0 scenario.
> 

Yes atleast this part works properly.

> Thanks,
> Kristen
> 
> Here are the logs when the failed occurred:
> 
> .82:
> Feb 19  2:24:23 [NOTICE  ] [GMI  ] Creating commit token because I am
> the rep.
> Feb 19  2:24:23 [NOTICE  ] [GMI  ] Storing new sequence id for ring
> 4228
> Feb 19  2:24:23 [NOTICE  ] [GMI  ] entering COMMIT state.
> Feb 19  2:24:23 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:24:23 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:24:24 [NOTICE  ] [GMI  ] Creating commit token because I am
> the rep.
> Feb 19  2:24:24 [NOTICE  ] [GMI  ] Storing new sequence id for ring
> 4232
> Feb 19  2:24:24 [NOTICE  ] [GMI  ] entering COMMIT state.
> Feb 19  2:24:24 [NOTICE  ] [GMI  ] entering RECOVERY state.
> Feb 19  2:24:24 [NOTICE  ] [GMI  ] Sending initial ORF token
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] New Configuration:
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.82
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.83
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.85
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.87
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] Members Left:
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.84
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.86
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] Members Joined:
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] New Configuration:
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.82
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.83
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.85
> Feb 19  2:24:24 [NOTICE  ] [CLM  ]      47.104.22.87
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] Members Left:
> Feb 19  2:24:24 [NOTICE  ] [CLM  ] Members Joined:
> Feb 19  2:24:24 [NOTICE  ] [GMI  ] entering OPERATIONAL state.
> 
> .83, .85, .87:
> Feb 19  2:14:56 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:14:56 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:14:57 [NOTICE  ] [GMI  ] Storing new sequence id for ring
> 4232
> Feb 19  2:14:57 [NOTICE  ] [GMI  ] entering COMMIT state.
> Feb 19  2:14:57 [NOTICE  ] [GMI  ] entering RECOVERY state.
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] New Configuration:
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.82
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.83
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.85
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.87
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] Members Left:
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.84
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.86
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] Members Joined:
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] New Configuration:
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.82
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.83
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.85
> Feb 19  2:14:57 [NOTICE  ] [CLM  ]      47.104.22.87
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] Members Left:
> Feb 19  2:14:57 [NOTICE  ] [CLM  ] Members Joined:
> Feb 19  2:14:57 [NOTICE  ] [GMI  ] entering OPERATIONAL state.
> 
> .86:
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] The token was lost in state 1 from
> timer 270f
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering GATHER state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] Creating commit token because I am
> the rep.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] Storing new sequence id for ring
> 4236
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering COMMIT state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering RECOVERY state.
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] Sending initial ORF token
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] New Configuration:
> Feb 19  2:20:31 [NOTICE  ] [CLM  ]      47.104.22.86
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] Members Left:
> Feb 19  2:20:31 [NOTICE  ] [CLM  ]      47.104.22.82
> Feb 19  2:20:31 [NOTICE  ] [CLM  ]      47.104.22.83
> Feb 19  2:20:31 [NOTICE  ] [CLM  ]      47.104.22.84
> Feb 19  2:20:31 [NOTICE  ] [CLM  ]      47.104.22.85
> Feb 19  2:20:31 [NOTICE  ] [CLM  ]      47.104.22.87
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] Members Joined:
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] New Configuration:
> Feb 19  2:20:31 [NOTICE  ] [CLM  ]      47.104.22.86
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] Members Left:
> Feb 19  2:20:31 [NOTICE  ] [CLM  ] Members Joined:
> Feb 19  2:20:31 [NOTICE  ] [GMI  ] entering OPERATIONAL state.
> 
> 
> 
> ______________________________________________________________________
> _______________________________________________
> Openais mailing list
> Openais at lists.osdl.org
> http://lists.osdl.org/mailman/listinfo/openais




More information about the Openais mailing list