[Openais] does openais need to consider the error happens inthe process of receiving a mcast message

Steven Dake sdake at mvista.com
Tue Jul 26 12:19:01 PDT 2005


These out of memory problems are currently unsolved.  Dealing with out
of memory sucks in a distributed system.  The memory pool idea can allow
the administrator of the cluster to define maximum limits at which point
the processor can be kicked from the configuration if it extends past
its allocation limits.  Unfortunately this is unimplemented but
something we need to fix in the future.

On the positive side, the protocol itself is correct in the face of out
of memory errors, except perhaps for the condition where a timer cannot
be allocated.

regards
-steve

On Tue, 2005-07-26 at 11:26 +0800, Li Huanghai wrote:
> 
> 
> >     I didn't show my problem clearly.
> > For example,
> >     evt.c,    when evt_remote_evt calls after receiving the mcast message,
> > if make_local_event error (because of malloc) , the event will lost.
> >     This may be not very serious . But in evt_remote_chan_op , if create_channel 
> > error , the cluster will be in an inconsistent state about the channel information
> > and can't recovery for ever.
> >  
> >     Also in ckpt.c ,  some funcation message_handler_req_exec_ckpt_<...> which need
> > malloc memory ,if one node malloc error ,the cluster will be in an inconsistent state.
> > 
> >     
> >     Another is poll_timer_add , if it error ,can the cluster work ok?
> > 
> >     And the same with libais_send_response . If it send error ,the lib will wait for ever
> > if ais funcation didn't give a timeout parameter.
> > 
> >     I've see mempool , it is great and it works well in amf. But ckpt and event maybe need
> > large memory size , is it reasonable if giving it a configuration bound ? If not and use
> > mempool_realloc , there's also probability these problems which make the cluster inconsistent
> >  happen.
> > 
> > 
> >  Regards
> > 
> > 
> > 
> > 
> > ----- Original Message ----- 
> > From: "Steven Dake" <sdake at mvista.com>
> > To: "Mark Haverkamp" <markh at osdl.org>
> > Cc: "Li Huanghai" <hhli at mail.ustc.edu.cn>; "Openais List" <openais at lists.osdl.org>
> > Sent: Tuesday, July 26, 2005 1:30 AM
> > Subject: Re: [Openais] does openais need to consider the error happens inthe process of receiving a mcast message
> > 
> > 
> > > On Mon, 2005-07-25 at 09:32 -0700, Mark Haverkamp wrote:
> > > > On Mon, 2005-07-25 at 13:01 +0800, Li Huanghai wrote:
> > > > > Hi,
> > > > >     I am puzzled with the openais's exception handling.
> > > > > When a node sends a message to all nodes,it doesn't
> > > > > wait for the other nodes' responses of the result that 
> > > > > does it handle the message correctly. That means once 
> > > > > a node handle the message error,such as the most malloc 
> > > > > error, the other nodes won't find it and consider it correct.
> > > > > Then the cluster is in an inconsistent state and the following
> > > > > operations will get error result but application consider it 
> > > > > true. This is a big problem for it is the high-availability
> > > > > software.
> > > > > 
> > > > >     How to consider this problem? Can it being ignored? If can't,
> > > > > how to deal with it ? Does it need a rollback policy to keep
> > > > > all nodes in a consisitent state.
> > > > > 
> > > > 
> > > > The protocol keeps track of messages by sequence number.  If a message
> > > > can't be received for some reason, the protocol will notice that it has
> > > > a missing message and request that the missing message be retransmitted.
> > > > In a way the protocol does wait for the nodes response because the token
> > > > contains the information about what the highest sequence number received
> > > > for messages with no sequence holes and a list of message sequence
> > > > numbers that need to be re-transmitted because someone hasn't received
> > > > them yet.
> > > > 
> > > 
> > > So Mark is correct; the protocol handles out of memory errors by
> > > "ignoring" the packet.  Then the packet is resent.  It is important to
> > > note that poll and other system calls may have problems in low memory
> > > situations, in which case they are likely to be bounced from the
> > > configuration.
> > > 
> > > The services themselves (such as checkpoint) are not very tolerant of
> > > out of memory errors.  I had planned to solve this problem through the
> > > use of memory pools, but it as yet remains unimplemented.
> > > 
> > > look at mempool.c/.h for more details
> > > 
> > > regards
> > > 
> > > > 
> > > > _______________________________________________
> > > > Openais mailing list
> > > > Openais at lists.osdl.org
> > > > https://lists.osdl.org/mailman/listinfo/openais
> > > 
> > >
> 
> _______________________________________________
> Openais mailing list
> Openais at lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/openais




More information about the Openais mailing list