[Openais] [Corosync] Corosync does not retransmit the lost mcast message

hj lee kerdosa at gmail.com
Fri Mar 19 15:45:54 PDT 2010


Hi Steve,

I added and changed some log messages, so my log won't match with the source
tree. Any way I think I found the problem. This issue seems to be happening
easily where a multicast messages are infrequently sent. The problem is the
rtr field is filled based on my_high_seq_received! It should be set based on
token->seq value.

Let's assume very simple case, just one mcast message(seq 77) was lost in
node2.

In node1:
all messages are received up to 77.
token seq = 77
my_aru = 77
my_high_seq_received = 77

in node2:
message 77 was lost.
my_aru = 76
token seq = 77
my_high_seq_received = 76


Once node2 gets into this state, it does not set the rtr filed for the lost
message 77. Then my_aru_count keeps increasing and the corosync enters
"FAILED TO RECEIVE" and gather. The totem spec. says clearly if token seq is
greater than my_aru, it means this processor lost some messages, it should
set rtr field to request the retransmission.

The related code is in orf_token_rtr() at totemsrp.c.

range = instance->my_high_seq_received - instance->my_aru;

Above line should be changed to

range = orf_token->seq - instance->my_aru;

What was the reason of introducing my_high_seq_received? The original spec
does not have this variable.

Thanks
hj


On Fri, Mar 19, 2010 at 9:59 AM, Steven Dake <sdake at redhat.com> wrote:

> can you please attach the logs from the last configuration change until
> the failure?
>
> It would really help me understand the condition so i can generate a
> reproducer.
>
> Thanks
> -steve
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20100319/5c234386/attachment-0001.htm 


More information about the Openais mailing list