[Openais] Problems with OpenAis version 0.81

Steven Dake sdake at redhat.com
Thu Sep 13 16:36:04 PDT 2007


On Fri, 2007-09-14 at 01:53 +0300, Cohen-Sason Daniel-BDC021 wrote:
> Hi.
> 
> Please see my comments below.
> 
>  
> 
> Daniel
> 
>  
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at redhat.com] 
> Sent: Friday, September 14, 2007 1:23 AM
> To: Cohen-Sason Daniel-BDC021
> Cc: openais at lists.osdl.org
> Subject: RE: [Openais] Problems with OpenAis version 0.81
> 
>  
> 
> On Fri, 2007-09-14 at 00:51 +0300, Cohen-Sason Daniel-BDC021 wrote:
> 
> > Hi Steve,
> 
> > 
> 
> > Thanks again for your time,
> 
> > 
> 
> > Please see my comments below.
> 
> > 
> 
> > Daniel
> 
> > 
> 
> > On Thu, 2007-09-13 at 10:15 +0300, Cohen-Sason Daniel-BDC021 wrote:
> 
> > > Thanks Steve!
> 
> > > It seems to be working! (The amf-example1)
> 
> > >
> 
> > > Now we need to upgrade our code to support AMF_B.
> 
> > >
> 
> > > My question is:
> 
> > > Assume there are 2 components on each node. Both components should
> be
> 
> > > running and functioning in order for a node to be considered
> "good".
> 
> > > All 4 components (lets call them: A1, B1, A2, B2) can be active at
> the
> 
> > > same time, and when one of the components breaks, its
> correspondent
> 
> > node
> 
> > > should give up its SU.
> 
> > >
> 
> > > When we used AMF_A, we registered to each component from each
> node.
> 
> > The
> 
> > > component name was declared in the groups.conf file as a simple
> 
> > string.
> 
> > > I understand that in AMF_B, it's a bit different. The full "path"
> of a
> 
> > > component (including its SG, SU, APP, COMP) is the "name" of that
> 
> > > component - Is it correct? Should my code now register to the
> "full"
> 
> > > path?
> 
> > 
> 
> > No - register with just the component name identified in the
> 
> > configuration file, not the DN name.  When the CSI is assigned it
> will
> 
> > get a DN of the full sg/su/csi.
> 
> > According to the example, the AMF application should get the name of
> the
> 
> > component t by running "saAmfComponentNameGet()" and use it with
> 
> > "saAmfComponentRegister()". So, how can an application register to 2
> 
> > components, one as its primary ownership and one as a standby? I've
> also
> 
> > noticed that saAmfComponentNameGet() returns
> 
> > "safComp=A,safSu=SERVICE_X_2,safSg=RAID,safApp=APP-1" which is the
> DN.
> 
> > So basically the registration to saAmfComponentRegister() is with
> the
> 
> > DN. Is it Correct?
> 
> > 
> 
>  
> 
> Let me explain with an example:
> 
>  
> 
> looking at the amf.conf we see:
> 
>              safComp = A {
> 
>                                         saAmfCompCategory=sa_aware
> 
>  
> 
> saAmfCompCapability=x_active_or_y_standb
> 
> y
> 
>  
> 
> the code to register this component in testamf1.c is:
> 
>                 result = saAmfComponentRegister (handle,
> 
> &compNameGlobal, NULL);
> 
>  
> 
> where compNameGlobal is set from the saAmfComponentNameGet api:
> 
>                 result = saAmfComponentNameGet (handle,
> 
> &compNameGlobal);
> 
>  
> 
>  
> 
> the saAmfComponentNameGet api uses an environment variable passed in
> the
> 
> forked environment to the clc_cli_script program.
> 
>  
> 
> So essentially if you use the above apis you shouldn't have to be
> 
> concerned with what the component is named.  It will be registered
> based
> 
> upon what the config file instructs.
> 
>  
> 
> [Daniel] Ok, this is clear, but what if there's only one instance of
> my application on each node (the "List" application), and this
> instance manages 2 components? (Two lists) Those components are
> activated and deactivated according to the state it receives from the
> IPC...
> 
> In other words, how can one instantiation trigger registering of two
> components when it is instantiating only once? Does my app should be
> clever enough to understand the component names? Should it have 1
> handler and register twice?
> 

So I assume you want to instantiate this "list" component twice because
it needs to handle two workloads.

You have two options.

1) you can instantiate the component once with two SIs.  When the
component is instantiated the CSICallback will be called once for each
SI giving you two seperate workloads that the one component will manage.
Individual attributes (key/value pairs) can be passed to each seperate
SI for the component.

2) You can list the component twice in the configuration file and
instantiate one SI into each component on each of the nodes.

I would recommend against having two component registers for the same
AMF component.  The purpose of the AMF is to ensure that write-once
components can handle any workload configured by the deployment team.
So it should be possible for one component to be written to handle a
certain number of workloads.

I believe the SA Forum has intended for third parties to write SA Forum
AMF Components that are reusable by other projects.  They would include
in their component some functionality and limits on behavior (this
component could handle 4 workloads maximum on a 2ghz machine, etc).  But
this work is not yet complete by the SA Forum.

>  
> 
>  
> 
> > >
> 
> > > My second question is: Does Aisexec is the one that *must*
> instantiate
> 
> > > our A and B services? This is want I understand from the provided
> 
> > > example... or maybe this is just an example?
> 
> > >
> 
> > 
> 
> > yes aisexec forks and execs the AMF applications and runs them.
> There
> 
> > is no other way for the AMF apps to be started in the latest AMF
> code.
> 
> > This is actually required of the AMF B specifications but wasn't
> defined
> 
> > too well in the A specifications.
> 
> > 
> 
> > 
> 
> > 
> 
> > So, does it mean that our applications will no longer be Linux
> services,
> 
> > and the ais will instantiate them on both nodes? What about
> re-invoking
> 
> > a service that was crashes? How does ais handle it? (if at all...)
> 
> > 
> 
>  
> 
> Yes openais's AMF (or any other AMF that is compliant with the
> 
> specification) will instantiate the components on the nodes for which
> 
> they are configured.
> 
>  
> 
> AMF B specs describes how a service is recovered.  A component in an
> su
> 
> is restarted a number of times.  If it is restarted in a certain
> number
> 
> of times within a timeout, the recovery algorithm escalates to
> 
> restarting the entire SU.  Once the entire su were to fail in a
> certain
> 
> configurable number of times within a specific time period, the entire
> 
> SU is failed over and any active CSIs assigned to the component are
> then
> 
> assigned to the standby.
> 
>  
> 
> page 141 of the AMF B.02.01 specs.
> 
>  
> 
> [Daniel] This is great!
> 
>  
> 
> > 
> 
> > > Does the explanation "run the example on a cluster with 2
> nodes" (from
> 
> > > README.AMF) describe our system?
> 
> > >
> 
> > 
> 
> > mostly except the example is designed to run one CSI as active and
> one
> 
> > CSI as standby.
> 
> > 
> 
> > > Can you try to provide us a sample configuration for the example
> 
> > above?
> 
> > >
> 
> > 
> 
> > Well first the best way to come up with a config is to define what
> you
> 
> > want.
> 
> > 
> 
> > First it sounds like you want 2n (which is modeled in openais via
> the n
> 
> > +m model).  
> 
> > 
> 
> > Correct!
> 
> > 
> 
> > Second it sounds like you have one SG with two redundant SUs
> 
> > on seperate nodes.  You want both those SUs to have 1 active CSI
> 
> > (component service instantiation) for each of the two components of
> the
> 
> > SU with no standby csi in the system?
> 
> > 
> 
> > Is that correct?
> 
> > 
> 
> > 
> 
> > 
> 
> > I think not. In AMF_A we had 2 groups. Each group with 2 SU. Each SU
> 
> > with 2 Components.
> 
> > 
> 
> > Are you familiar with AMF_A? I attached the groups.conf file we are
> 
> > using. I hope it will help you to understand our system.
> 
> > 
> 
> > In that file "hp1.motorola" and "hp2.motorola" are the two nodes,
> while
> 
> > "list1/list2" and "monitor1/monitor2" are our components (which are
> 
> > instances of LIST and MONITOR applications).
> 
> > 
> 
> > The optimal scenario is when "list1" and "monitor1" are running on
> hp1,
> 
> > while "list2" and "monitor2" are running on hp2.
> 
> > 
> 
> > If "list1" or "monitor1" reports a problem, or deactivated for some
> 
> > reason. hp2 takes control over "list1" and activate it.
> 
> > 
> 
> > This is true also for the opposite direction.
> 
> > 
> 
> >  
> 
>  
> 
> It appears to me you want active/active or a n+m model with n = 2 and
> m
> 
> = 0.  This can be set in the configuration file:
> 
>                         saAmfSGNumPrefActiveSUs=1
> 
> which specifies how many active SUs should be in the system by
> 
> preference:
> 
>  
> 
> and
> 
>                         saAmfSGNumPrefStandbySUs=1
> 
> which specifies how many standby SUs should be in the system by
> 
> preference:
> 
> RestartProb is the number of milliseconds in which the component or su
> 
> is restarted before escalating to the next escalation level:
> 
>  
> 
> For example:
> 
>                        saAmfSGRedundancyModel=nplusm
> 
>                         saAmfSGNumPrefActiveSUs=1
> 
>                         saAmfSGMaxActiveSIsperSUs=2
> 
>                         saAmfSGNumPrefStandbySUs=1
> 
>                         saAmfSGMaxStandbySIsperSUs=2
> 
>                         saAmfSGCompRestartProb=100000
> 
>                         saAmfSGCompRestartMax=1
> 
>                         saAmfSGSuRestartProb=20000
> 
>                         saAmfSGSuRestartMax=1
> 
>  
> 
> says
> 
> I want 1 active su, 1 standby su, 2 SIs per su in the active state, 2
> 
> SIs per su in the standby state, A component should be restarted 1
> time
> 
> (SGCompRestartMax) and if it fails more then the maximum in the last
> 
> 100000 milliseconds to escalate to the next level.  This triggers a
> 
> restart of the SU which will trigger an escalation to failover if the
> 
> service unit is restarted more then 1 time within 20000 milliseconds.
> 
>  
> 
> [Daniel] Thanks for this. Let me see if I understood the structure:
> 
>  
> 
>       Cluster
> 
>             APP
> 
>                   SG1
> 
>                         SU1
> 
>                               COMP-A
> 
>                               COMP-B
> 
>                         SU2
> 
>                               COMP-A
> 
>                               COMP-B
> 
>                   SG2
> 
>                         SU1
> 
>                               COMP-A
> 
>                               COMP-B
> 
>                         SU2
> 
>                               COMP-A
> 
>                               COMP-B
> 
>                   SI-WL1
> 
>                         WL1-1
> 
>                         WL1-2
> 
>                   SI-WL2
> 
>                         WL2-1
> 
>                         WL2-2
> 
>                   CS-TYPE = A
> 
>                   CS-TYPE = B
> 
>  
> 
> Is this correct?
> 
This seems correct although you may have to have unique name spaces for
the purposes of passing in attributes to the components if the
components A and B on the different nodes require different CSI
Attributes.  I can't recall what the current AMF does in this regard.

>  
> 
>  A failure is detected when the application disconnects from the IPC
> 
> connection, or the node fails, or a healthcheck does not respond in
> the
> 
> allocated time period for the healthcheck.
> 
>  
> 
> the safSi directive specifies the amf components that should be
> started
> 
> within the CSI.  the CSIAttr parameter specifies which attributes
> should
> 
> be passed into the activated component service instance.
> 
>  
> 
> As can be seen AMF B is quite a bit more complex then AMF A.  You may
> 
> have to change your app a bit since it appears monitor was used to
> 
> monitor if the application failed.
> 
>  
> 
> [Daniel] In our case, "Monitor" is just a monitor who monitors
> addresses and assigns them when needed; it does not monitor the "LIST"
> service... 
> 
I see

regards
-steve
>  
> 
> Regards
> 
> -steve
> 
>  
> 
> > 
> 
> > > Thank you very much.
> 
> > >
> 
> > > Daniel.
> 
> > >
> 
> > > P.S
> 
> > > Is there a document with detailed explanation on the AMF.CONF
> file? I
> 
> > > couldn't find it at the web.
> 
> > >
> 
> > 
> 
> > Take a look at the AMF information model in saiOverview.B0301.pdf.
> We
> 
> > pretty closely match that model section 5.5 page 76.
> 
> > 
> 
> > The AMF document itself explains what most of the variables do.
> 
> > 
> 
> > There is also the not so complete amf.conf man page.
> 
> > 
> 
> > Regards
> 
> > -steve
> 
> > 
> 
> > 
> 
> > > Thanks again, Daniel
> 
> > >
> 
> > >
> 
> > >
> 
> > > -----Original Message-----
> 
> > > From: Steven Dake [mailto:sdake at redhat.com]
> 
> > > Sent: Thursday, September 13, 2007 4:37 AM
> 
> > > To: Cohen-Sason Daniel-BDC021
> 
> > > Cc: openais at lists.osdl.org
> 
> > > Subject: RE: [Openais] Problems with OpenAis version 0.81
> 
> > >
> 
> > > few extra notes:
> 
> > >
> 
> > > make sure to change your hostname in the config file back to
> CENTOS
> 
> > >
> 
> > > Whatever the command "hostname" returns is what you want in that
> field
> 
> > >
> 
> > > Regards
> 
> > > -steve
> 
> > >
> 
> > > On Wed, 2007-09-12 at 18:29 -0700, Steven Dake wrote:
> 
> > > > Daniel,
> 
> > > >
> 
> > > > Try this amf.conf file and clc_cli_script.
> 
> > > >
> 
> > > > Essentially the healthcheck keys for the default configuration
> file
> 
> > > are
> 
> > > > invalid and also the path to clc_cli_script do not execute
> 
> > > > openais-instantiate.
> 
> > > >
> 
> > > > With these files 0.81 instantiates the components for me and
> works
> 
> > as
> 
> > > > expected.
> 
> > > >
> 
> > > > Regards
> 
> > > > -steve
> 
> > > >
> 
> > > > On Thu, 2007-09-13 at 00:50 +0300, Cohen-Sason Daniel-BDC021
> wrote:
> 
> > > > > Hi Steve.
> 
> > > > >
> 
> > > > > Thanks for the quick response!
> 
> > > > > I'm sorry, but I forgot to attach the log I generated.
> 
> > > > >
> 
> > > > > My hostname is CENTOS, and I set this name at the
> saAmfNodeClmNode
> 
> > > > > directive (Should it be at the safAmfNode?)
> 
> > > > >
> 
> > > > > I'm also logged in as root with root group.
> 
> > > > >
> 
> > > > > I verified and all the 3 files are executables with 755.
> 
> > > > >
> 
> > > > > I tried to change the timeouts to 5000 and it didn't help.
> 
> > > > >
> 
> > > > > Attached please find the fixed configuration files and the
> log.
> 
> > > > >
> 
> > > > > Hope this will be help you to find out the problem.
> 
> > > > >
> 
> > > > > Thanks,
> 
> > > > >
> 
> > > > > Daniel.
> 
> > > > >
> 
> > > > >
> 
> > > > > -----Original Message-----
> 
> > > > > From: Steven Dake [mailto:sdake at redhat.com]
> 
> > > > > Sent: Wednesday, September 12, 2007 9:14 PM
> 
> > > > > To: Cohen-Sason Daniel-BDC021
> 
> > > > > Cc: openais at lists.osdl.org
> 
> > > > > Subject: Re: [Openais] Problems with OpenAis version 0.81
> 
> > > > >
> 
> > > > > Without logs it is difficult to tell what went wrong.
> 
> > > > >
> 
> > > > > I would verify the following things:
> 
> > > > >
> 
> > > > > You a node that has name service resolution to CENTOS.  That
> is
> 
> > > required
> 
> > > > > by the "safAmfNode = AMF1" directive.  If this doesn't match,
> 
> > > testamf1
> 
> > > > > will never be started.  Put some kind of log output in
> testamf1 to
> 
> > > > > verify it is actually started by AMF.  If it isn't started,
> this
> 
> > is
> 
> > > > > likely the cause of the problem.
> 
> > > > >
> 
> > > > > verify:
> 
> > > > > you have mode 755 clc_cli_script in /tmp/aisexample
> 
> > > > > you have mode 755 openais-instantiate in /tmp/aisexample
> 
> > > > > you have mode 755 testamf1 in /tmp/aisexample
> 
> > > > >
> 
> > > > > I have noticed sometimes the default time values shipped in
> 
> > amf.conf
> 
> > > do
> 
> > > > > not work very well with some applications resulting in false
> 
> > > positive
> 
> > > > > failure detections.  Try changing them as follows:
> 
> > > > >
> 
> > > > >                              saAmfCompDefaultClcCliTimeout =
> 5000
> 
> > > > >                              saAmfCompDefaultCallbackTimeOut =
> 
> > 5000
> 
> > >
> 
> > > > >                              saAmfHealthcheckPeriod = 5000
> 
> > > > >
> 
> > > > > please send the log file generated by AIS.
> 
> > > > >
> 
> > > > > Regards
> 
> > > > > -steve
> 
> > > > >
> 
> > > > > On Wed, 2007-09-12 at 08:45 +0300, Cohen-Sason Daniel-BDC021
> 
> > wrote:
> 
> > > > > > Hello
> 
> > > > > >
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > We, at Motorola, are trying to upgrade to Ver 0.81 from Ver
> 
> > 0.70.
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > The reason why we do it is because we encounter an unstable
> 
> > > behavior
> 
> > > > > > with 0.70, and hope that 0.81 will be better.
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > I first started by trying to run the "amfexample1", but for
> some
> 
> > > > > > reason, the readiness states of the components are always
> 
> > becoming
> 
> > > > > > OUT_OF_SERVICE (after I run: ./aisexec -f).
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > I followed the steps which are described in README.AMF.
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > I also noticed the bindnetaddr should be set, or the
> 
> > instantiation
> 
> > > > > > won't even start.
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > Attached please find the log and the configuration files
> 
> > > > > > from /tmp/aisexample folder.
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > I really hope you can help us.
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > Please advice,
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > > Daniel
> 
> > > > > >
> 
> > > > > > ________________
> 
> > > > > > Daniel Cohen-Sason
> 
> > > > > > MCIL, Design Center, Public Safety.
> 
> > > > > > Multi-Net-Mobility (MNM).
> 
> > > > > > Office:  +972-3-5658548
> 
> > > > > > Private: +972-57-5658548 (8548)
> 
> > > > > >
> 
> > > > > > Daniel.Cohen at motorola.com
> 
> > > > > > "The significant problems we have cannot be solved at the
> same
> 
> > > level
> 
> > > > > > of thinking with which we created them." [Albert Einstein]
> 
> > > > > >
> 
> > > > > >
> 
> > > > > > 
> 
> > > > > >
> 
> > > > > >
> 
> > > > > > _______________________________________________
> 
> > > > > > Openais mailing list
> 
> > > > > > Openais at lists.linux-foundation.org
> 
> > > > > > https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> > > > >
> 
> > > > _______________________________________________
> 
> > > > Openais mailing list
> 
> > > > Openais at lists.linux-foundation.org
> 
> > > > https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> > >
> 
> > 
> 
>  
> 
> 



More information about the Openais mailing list