[Openais] whitetank/flatiron node id mismatch on big-endian platforms
Steven Dake
sdake at redhat.com
Tue Mar 16 02:10:25 PDT 2010
On Tue, 2010-03-09 at 21:17 +0100, Dejan Muhamedagic wrote:
> Hello,
>
> An upgrade of one node from whitetank (openais) to flatiron
> (corosync) made that node incapable of joining the cluster again.
> During the upgrade the other nodes were running. The issue turned
> out to be that between the two releases the node ids are
> calculated differently on big endian platforms (this was on
> s390x). Actually, openais/corosync didn't find this a problem,
> but pacemaker couldn't match the old and the new node id.
>
> Reverting the following patch fixed the issue:
>
I wanted to think about this more, before responding so please excuse
the delay.
My initial thoughts are we put this patch in to fix a serious problem
with endianness coming from the cpg service for nodeids (and affecting
all other clusters). The root of the problem was that the endian of the
nodeid was never known coming out of cpg since it wasn't stored in any
particular order.
I still think this patch is correct but it does indeed break backward
compat. One option is to default to the pre-patch behavior if
compatibility:whitetank is set.
I'll have to think through how to do that in a non-ABI breaking way. A
patch that someone else writes to do this is also a good solution :-).
Another/parallel option is to maintain a revert of the patch in your
packages until we get this problem sorted out.
Would you file a bugzilla against the corosync package in fedora
rawhide? It is where I track fedora bugs so I don't lose track of it.
Regards
-steve
> Index: branches/flatiron/exec/totemip.c
> ===================================================================
> --- branches/flatiron/exec/totemip.c (revision 2428)
> +++ branches/flatiron/exec/totemip.c (revision 2429)
> @@ -376,6 +376,9 @@
> */
> totemip_sockaddr_to_totemip_convert((struct sockaddr_storage *)sockaddr_in, boundto);
> boundto->nodeid = sockaddr_in->sin_addr.s_addr;
> +#if __BYTE_ORDER == __BIG_ENDIAN
> + boundto->nodeid = swab32 (boundto->nodeid);
> +#endif
>
> if (ioctl(id_fd, SIOCGLIFFLAGS, &lifreq[i]) < 0) {
> printf ("couldn't do ioctl\n");
> @@ -614,6 +617,9 @@
> if (ipaddr.family == AF_INET && ipaddr.nodeid == 0) {
> unsigned int nodeid = 0;
> memcpy (&nodeid, ipaddr.addr, sizeof (int));
> +#if __BYTE_ORDER == __BIG_ENDIAN
> + nodeid = swab32 (nodeid);
> +#endif
> if (mask_high_bit) {
> nodeid &= 0x7FFFFFFF;
> }
>
> The nodeids with flatiron do appear now the same on both big and
> little endian platforms, but this regression prevents rolling
> upgrades of single nodes. Also, the ids are in a reversed order,
> for instance 192.168.100.13 gets the id 224700608 (hex 0D64A8C0).
>
> There is some discussion at the Novell bugzilla:
> https://bugzilla.novell.com/show_bug.cgi?id=584976
>
> Thanks,
>
> Dejan
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
More information about the Openais
mailing list