[Openais] [corosync] [patch] - Fix problems with long token timeout and cpg

Steven Dake sdake at redhat.com
Fri Jul 10 07:15:55 PDT 2009


Honza,

This patch looks alright.  I think we also need a further unique value
which represents the cpd address.  The cpd address+nodeid are unique
comparitors for finding cpds.  This can then be used to allow us to
uniquely identify leave messages with the original join messages in the
case where we have more then one handle joined to the same process group
within a process.

Test case attached (which currently fails).

I've looked briefly at solving the problem and found three things.
First do_proc_join's calling of process_info_find should be placed
inside the message_handler_req_exec_cpg_joinlist while loop to allow a
new proc join to be added to the list when cpg_join is called from the
library.

Second the list_del that message_handler_req_exec_cpg_procleave executes
should probably break immediately out of the while loop after list del.

Third the searching for cpds in the join and leave handling needs
another comparison which includes the cpd's address from the join
message contents, otherwise it sets the state to UNJOINED (because it
matches the first cpd in the list) and causes further leaves to fail
after 10-20 are processed.

Regards
-steve


On Wed, 2009-07-01 at 18:21 +0200, Jan Friesse wrote:
> Included patch should fix
> https://bugzilla.redhat.com/show_bug.cgi?id=506255 .
> 
> David, I hope it will fix problem for you.
> 
> It's based on simple idea of adding node startup timestamp at the end of
> cpg_join (and joinlist) calls. If timestamp is larger then old timestamp
> we know, node was restarted and we didn't notices -> deliver leave event
> and then join event. If timestamp is same (or in special cases lower) ->
> new cpg app joined -> send only join event.
> 
> Of course, patch isn't so simple. Cpg_join messages are always send as
> larger messages with timestamp (btw. timestamp is 64-bit value, because
> I expect l(o^64)ng life of corosync ;) ). On delivery, we test, if
> message is larger then standard message. If it is -> we have ts -> use it.
> 
> Bigger problem was joinlist, because it's array, ... you will see in
> source. Solution is, to send special entry, with pid 0 (shouldn't ever
> happened to process, to have pid 0), and timestamp encoded in name
> (ugly, but looks like working).
> 
> Please comment, if you can.
> 
> Regards,
>   Honza
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync-trunk-stress-cpgjoinleave.patch
Type: text/x-patch
Size: 4791 bytes
Desc: not available
Url : http://lists.linux-foundation.org/pipermail/openais/attachments/20090710/b4b65b97/attachment.bin 


More information about the Openais mailing list