[Openais] [Corosync] Corosync does not retransmit the lost mcast message

Sat Mar 20 12:15:26 PDT 2010

On Sat, 2010-03-20 at 12:46 -0600, hj lee wrote:
> 
> 
> On Fri, Mar 19, 2010 at 7:42 PM, Steven Dake <sdake at redhat.com> wrote:
>         On Fri, 2010-03-19 at 16:45 -0600, hj lee wrote:
>         > Hi Steve,
>         >
>         > I added and changed some log messages, so my log won't match
>         with the
>         > source tree. Any way I think I found the problem. This issue
>         seems to
>         > be happening easily where a multicast messages are
>         infrequently sent.
>         > The problem is the rtr field is filled based on
>         my_high_seq_received!
>         > It should be set based on token->seq value.
>         >
>         
>         
>         I did notice this inconsistency and was thinking along these
>         lines too,
>         but I wanted to see your log to see if some other events were
>         occuring
>         related to oldring_state_save()/restore().  Another
>         possibility is some
>         sort of misfeeding to or from the regular/recovery queue. (ie
>         do you
>         have more the this retransmission bug)
> 
> If I have a time, I will try to post some logs. I have some issues in
> recovery also. When the corosync enters GATHER/RECOVERY mode with
> whatever reasons, there are cases it just keeps looping
> GATHER/COMMIT/RECOVERY/OPERATIONAL and again and again ... I haven't
> had the time to debug this kind of issues, right now just trying to
> prevent whatever errors at first.
> 
>         
>         > Let's assume very simple case, just one mcast message(seq
>         77) was lost
>         > in node2.
>         >
>         > In node1:
>         > all messages are received up to 77.
>         > token seq = 77
>         > my_aru = 77
>         > my_high_seq_received = 77
>         >
>         > in node2:
>         > message 77 was lost.
>         > my_aru = 76
>         > token seq = 77
>         > my_high_seq_received = 76
>         >
>         >
>         > Once node2 gets into this state, it does not set the rtr
>         filed for the
>         > lost message 77. Then my_aru_count keeps increasing and the
>         corosync
>         > enters "FAILED TO RECEIVE" and gather. The totem spec. says
>         clearly if
>         > token seq is greater than my_aru, it means this processor
>         lost some
>         > messages, it should set rtr field to request the
>         retransmission.
>         >
>         > The related code is in orf_token_rtr() at totemsrp.c.
>         >
>         > range = instance->my_high_seq_received - instance->my_aru;
>         >
>         > Above line should be changed to
>         >
>         > range = orf_token->seq - instance->my_aru;
>         >
>         
>         
>         Ya good catch.  More totem experts = win for the community ;)
>  
> Thank you very much. Anyway this may generate extra mcast message that
> is already on the way but hasn't been received yet. So yesterday I was
> thinking only use token seq if my_aru_count is greater then 5 or some
> number. 
> 

I don't believe that is the case.  The only time this problem really
occurs is when the last message was not received by a particular
processor.  In that case, range will be off by the last messages not
received in a row, triggering the fail to recv state..

On receipt of the token, we flush the file descriptor related to mcast
recv messages to ensure the internal state of totem is up to date with
regards to every piece of information we have available.

Even if that were to happen, it only happens in recovery of lost
messages which is rare in modern lan environments.  So rare this bug has
been in the totem code for 8 years undetected...

Regards
-steve
>