[Openais] corosync trunk - patch to fix retransmit messages

Steven Dake sdake at redhat.com
Sat Mar 20 12:54:49 PDT 2010


HJ Lee identified a problem which is described in more detail below.  A
patch is attached to resolve it.

If the last message as organized in the total order is not received by a
processor, and that processor is still active, and no new messages are
originated for fail_to_recv_const (default = 50) token rotations, fail to recv
will happen improperly.  The reason is that the proper information isn't used
when determining the range of messages that should be checked for recovery,
resulting in an "off by X" where X is the number of messages in the total order
that have not been received by the processor at the end of the order.

Example:
Processor A sends A=1 B=2 C=3
Processor B sends D=4 E=5 F=6
Procesosr C sends G=7 H=8 I=9

Processor A receives A(1), B(2), C(3), D(4), E(5), F(6), G(7), H(8), I(9)
Procesosr C receives A(1), B(2), C(3), D(4), E(5), F(6), G(7), H(8), I(9)
Procesosr B receives A(1), B(2) D(4) then has some transient fault in the
kernel which allows it to receive udp packets but temporarily disrupts its
multicast transmit

Processor B should request C, E, F, G, H, I to be added to the retransmit
list.

In the current code and example, processor B has a high_seq_received (the
highest sequence the processor has currently received) of 4 a token->seq of 9. 
It uses high_seq_received (4) - the my all received (which is 2).  This gives a
range of 2 which will request recovery of missing messages for 3-4.  In this
example, totem will only recover C(3) but not E-I (the messages at the end of
the ordering).

Instead the retransmit list should have a range of 7 (token->seq - processor's
my_aru).  This will request retranmissions of messages that are missing on the
local processor from 3-9.

If no new messages are received within the fail_to_recv_const window increasing high_seq_received on the processor, fail to recv occurs.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync-trunk-fix-retransmit.patch
Type: text/x-patch
Size: 477 bytes
Desc: not available
Url : http://lists.linux-foundation.org/pipermail/openais/attachments/20100320/0bb70b47/attachment.bin 


More information about the Openais mailing list