[Openais] [corosync] [patch] - Fix problems with long token timeout and cpg

David Teigland teigland at redhat.com
Wed Jul 1 11:46:03 PDT 2009


On Wed, Jul 01, 2009 at 06:21:14PM +0200, Jan Friesse wrote:
> Included patch should fix
> https://bugzilla.redhat.com/show_bug.cgi?id=506255 .
> 
> David, I hope it will fix problem for you.
> 
> It's based on simple idea of adding node startup timestamp at the end of
> cpg_join (and joinlist) calls. If timestamp is larger then old timestamp
> we know, node was restarted and we didn't notices -> deliver leave event
> and then join event. If timestamp is same (or in special cases lower) ->
> new cpg app joined -> send only join event.
> 
> Of course, patch isn't so simple. Cpg_join messages are always send as
> larger messages with timestamp (btw. timestamp is 64-bit value, because
> I expect l(o^64)ng life of corosync ;) ). On delivery, we test, if
> message is larger then standard message. If it is -> we have ts -> use it.
> 
> Bigger problem was joinlist, because it's array, ... you will see in
> source. Solution is, to send special entry, with pid 0 (shouldn't ever
> happened to process, to have pid 0), and timestamp encoded in name
> (ugly, but looks like working).
> 
> Please comment, if you can.

This isn't specifically a cpg bug/problem, it's a problem with
corosync/openais in general.  When a node joins the cluster before others have
recognized it failed, the other nodes should immediately recognize it has
previously failed and process a complete failure for it.

Dave



More information about the Openais mailing list