[Openais] [Corosync] Corosync does not retransmit the lost mcast message
hj lee
kerdosa at gmail.com
Fri Mar 19 15:45:54 PDT 2010
Hi Steve,
I added and changed some log messages, so my log won't match with the source
tree. Any way I think I found the problem. This issue seems to be happening
easily where a multicast messages are infrequently sent. The problem is the
rtr field is filled based on my_high_seq_received! It should be set based on
token->seq value.
Let's assume very simple case, just one mcast message(seq 77) was lost in
node2.
In node1:
all messages are received up to 77.
token seq = 77
my_aru = 77
my_high_seq_received = 77
in node2:
message 77 was lost.
my_aru = 76
token seq = 77
my_high_seq_received = 76
Once node2 gets into this state, it does not set the rtr filed for the lost
message 77. Then my_aru_count keeps increasing and the corosync enters
"FAILED TO RECEIVE" and gather. The totem spec. says clearly if token seq is
greater than my_aru, it means this processor lost some messages, it should
set rtr field to request the retransmission.
The related code is in orf_token_rtr() at totemsrp.c.
range = instance->my_high_seq_received - instance->my_aru;
Above line should be changed to
range = orf_token->seq - instance->my_aru;
What was the reason of introducing my_high_seq_received? The original spec
does not have this variable.
Thanks
hj
On Fri, Mar 19, 2010 at 9:59 AM, Steven Dake <sdake at redhat.com> wrote:
> can you please attach the logs from the last configuration change until
> the failure?
>
> It would really help me understand the condition so i can generate a
> reproducer.
>
> Thanks
> -steve
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20100319/5c234386/attachment-0001.htm
More information about the Openais
mailing list