[Openais] whitetank/flatiron node id mismatch on big-endian platforms

Dejan Muhamedagic dejan at suse.de
Fri Mar 26 02:17:39 PDT 2010


Hi,

On Tue, Mar 16, 2010 at 02:10:25AM -0700, Steven Dake wrote:
> On Tue, 2010-03-09 at 21:17 +0100, Dejan Muhamedagic wrote:
> > Hello,
> > 
> > An upgrade of one node from whitetank (openais) to flatiron
> > (corosync) made that node incapable of joining the cluster again.
> > During the upgrade the other nodes were running. The issue turned
> > out to be that between the two releases the node ids are
> > calculated differently on big endian platforms (this was on
> > s390x). Actually, openais/corosync didn't find this a problem,
> > but pacemaker couldn't match the old and the new node id.
> > 
> > Reverting the following patch fixed the issue:
> > 
> I wanted to think about this more, before responding so please excuse
> the delay.
> 
> My initial thoughts are we put this patch in to fix a serious problem
> with endianness coming from the cpg service for nodeids (and affecting
> all other clusters).  The root of the problem was that the endian of the
> nodeid was never known coming out of cpg since it wasn't stored in any
> particular order.
> 
> I still think this patch is correct but it does indeed break backward
> compat.  One option is to default to the pre-patch behavior if
> compatibility:whitetank is set.

That sounds like a good idea.

> I'll have to think through how to do that in a non-ABI breaking way.  A
> patch that someone else writes to do this is also a good solution :-).

Not sure how would that break ABI. Looked a bit into how to make
a patch, but the compatibility information is static in main.c
(minimum_sync_mode).

> Another/parallel option is to maintain a revert of the patch in your
> packages until we get this problem sorted out.

Yes, that's what we'll do.

> Would you file a bugzilla against the corosync package in fedora
> rawhide?  It is where I track fedora bugs so I don't lose track of it.

https://bugzilla.redhat.com/show_bug.cgi?id=577129

Cheers,

Dejan

> Regards
> -steve
> 
> > Index: branches/flatiron/exec/totemip.c
> > ===================================================================
> > --- branches/flatiron/exec/totemip.c    (revision 2428)
> > +++ branches/flatiron/exec/totemip.c    (revision 2429)
> > @@ -376,6 +376,9 @@
> >               */
> >              totemip_sockaddr_to_totemip_convert((struct sockaddr_storage *)sockaddr_in, boundto);
> >              boundto->nodeid = sockaddr_in->sin_addr.s_addr;
> > +#if __BYTE_ORDER == __BIG_ENDIAN
> > +            boundto->nodeid = swab32 (boundto->nodeid);
> > +#endif
> > 
> >              if (ioctl(id_fd, SIOCGLIFFLAGS, &lifreq[i]) < 0) {
> >                  printf ("couldn't do ioctl\n");
> > @@ -614,6 +617,9 @@
> >      if (ipaddr.family == AF_INET && ipaddr.nodeid == 0) {
> >                  unsigned int nodeid = 0;
> >                  memcpy (&nodeid, ipaddr.addr, sizeof (int));
> > +#if __BYTE_ORDER == __BIG_ENDIAN
> > +        nodeid = swab32 (nodeid);
> > +#endif
> >          if (mask_high_bit) {
> >                          nodeid &= 0x7FFFFFFF;
> >          }
> > 
> > The nodeids with flatiron do appear now the same on both big and
> > little endian platforms, but this regression prevents rolling
> > upgrades of single nodes. Also, the ids are in a reversed order,
> > for instance 192.168.100.13 gets the id 224700608 (hex 0D64A8C0).
> > 
> > There is some discussion at the Novell bugzilla:
> > https://bugzilla.novell.com/show_bug.cgi?id=584976
> > 
> > Thanks,
> > 
> > Dejan
> > _______________________________________________
> > Openais mailing list
> > Openais at lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/openais
> 


More information about the Openais mailing list