[Openais] Problems with OpenAis version 0.81

Cohen-Sason Daniel-BDC021 Daniel.Cohen at motorola.com
Thu Sep 13 15:53:58 PDT 2007


Hi.

Please see my comments below.

 

Daniel

 

-----Original Message-----
From: Steven Dake [mailto:sdake at redhat.com] 
Sent: Friday, September 14, 2007 1:23 AM
To: Cohen-Sason Daniel-BDC021
Cc: openais at lists.osdl.org
Subject: RE: [Openais] Problems with OpenAis version 0.81

 

On Fri, 2007-09-14 at 00:51 +0300, Cohen-Sason Daniel-BDC021 wrote:

> Hi Steve,

> 

> Thanks again for your time,

> 

> Please see my comments below.

> 

> Daniel

> 

> On Thu, 2007-09-13 at 10:15 +0300, Cohen-Sason Daniel-BDC021 wrote:

> > Thanks Steve!

> > It seems to be working! (The amf-example1)

> >

> > Now we need to upgrade our code to support AMF_B.

> >

> > My question is:

> > Assume there are 2 components on each node. Both components should
be

> > running and functioning in order for a node to be considered "good".

> > All 4 components (lets call them: A1, B1, A2, B2) can be active at
the

> > same time, and when one of the components breaks, its correspondent

> node

> > should give up its SU.

> >

> > When we used AMF_A, we registered to each component from each node.

> The

> > component name was declared in the groups.conf file as a simple

> string.

> > I understand that in AMF_B, it's a bit different. The full "path" of
a

> > component (including its SG, SU, APP, COMP) is the "name" of that

> > component - Is it correct? Should my code now register to the "full"

> > path?

> 

> No - register with just the component name identified in the

> configuration file, not the DN name.  When the CSI is assigned it will

> get a DN of the full sg/su/csi.

> According to the example, the AMF application should get the name of
the

> component t by running "saAmfComponentNameGet()" and use it with

> "saAmfComponentRegister()". So, how can an application register to 2

> components, one as its primary ownership and one as a standby? I've
also

> noticed that saAmfComponentNameGet() returns

> "safComp=A,safSu=SERVICE_X_2,safSg=RAID,safApp=APP-1" which is the DN.

> So basically the registration to saAmfComponentRegister() is with the

> DN. Is it Correct?

> 

 

Let me explain with an example:

 

looking at the amf.conf we see:

             safComp = A {

                                        saAmfCompCategory=sa_aware

 

saAmfCompCapability=x_active_or_y_standb

y

 

the code to register this component in testamf1.c is:

                result = saAmfComponentRegister (handle,

&compNameGlobal, NULL);

 

where compNameGlobal is set from the saAmfComponentNameGet api:

                result = saAmfComponentNameGet (handle,

&compNameGlobal);

 

 

the saAmfComponentNameGet api uses an environment variable passed in the

forked environment to the clc_cli_script program.

 

So essentially if you use the above apis you shouldn't have to be

concerned with what the component is named.  It will be registered based

upon what the config file instructs.

 

[Daniel] Ok, this is clear, but what if there's only one instance of my
application on each node (the "List" application), and this instance
manages 2 components? (Two lists) Those components are activated and
deactivated according to the state it receives from the IPC...

In other words, how can one instantiation trigger registering of two
components when it is instantiating only once? Does my app should be
clever enough to understand the component names? Should it have 1
handler and register twice?

 

 

> >

> > My second question is: Does Aisexec is the one that *must*
instantiate

> > our A and B services? This is want I understand from the provided

> > example... or maybe this is just an example?

> >

> 

> yes aisexec forks and execs the AMF applications and runs them.  There

> is no other way for the AMF apps to be started in the latest AMF code.

> This is actually required of the AMF B specifications but wasn't
defined

> too well in the A specifications.

> 

> 

> 

> So, does it mean that our applications will no longer be Linux
services,

> and the ais will instantiate them on both nodes? What about
re-invoking

> a service that was crashes? How does ais handle it? (if at all...)

> 

 

Yes openais's AMF (or any other AMF that is compliant with the

specification) will instantiate the components on the nodes for which

they are configured.

 

AMF B specs describes how a service is recovered.  A component in an su

is restarted a number of times.  If it is restarted in a certain number

of times within a timeout, the recovery algorithm escalates to

restarting the entire SU.  Once the entire su were to fail in a certain

configurable number of times within a specific time period, the entire

SU is failed over and any active CSIs assigned to the component are then

assigned to the standby.

 

page 141 of the AMF B.02.01 specs.

 

[Daniel] This is great!

 

> 

> > Does the explanation "run the example on a cluster with 2 nodes"
(from

> > README.AMF) describe our system?

> >

> 

> mostly except the example is designed to run one CSI as active and one

> CSI as standby.

> 

> > Can you try to provide us a sample configuration for the example

> above?

> >

> 

> Well first the best way to come up with a config is to define what you

> want.

> 

> First it sounds like you want 2n (which is modeled in openais via the
n

> +m model).  

> 

> Correct!

> 

> Second it sounds like you have one SG with two redundant SUs

> on seperate nodes.  You want both those SUs to have 1 active CSI

> (component service instantiation) for each of the two components of
the

> SU with no standby csi in the system?

> 

> Is that correct?

> 

> 

> 

> I think not. In AMF_A we had 2 groups. Each group with 2 SU. Each SU

> with 2 Components.

> 

> Are you familiar with AMF_A? I attached the groups.conf file we are

> using. I hope it will help you to understand our system.

> 

> In that file "hp1.motorola" and "hp2.motorola" are the two nodes,
while

> "list1/list2" and "monitor1/monitor2" are our components (which are

> instances of LIST and MONITOR applications).

> 

> The optimal scenario is when "list1" and "monitor1" are running on
hp1,

> while "list2" and "monitor2" are running on hp2.

> 

> If "list1" or "monitor1" reports a problem, or deactivated for some

> reason. hp2 takes control over "list1" and activate it.

> 

> This is true also for the opposite direction.

> 

>  

 

It appears to me you want active/active or a n+m model with n = 2 and m

= 0.  This can be set in the configuration file:

                        saAmfSGNumPrefActiveSUs=1

which specifies how many active SUs should be in the system by

preference:

 

and

                        saAmfSGNumPrefStandbySUs=1

which specifies how many standby SUs should be in the system by

preference:

RestartProb is the number of milliseconds in which the component or su

is restarted before escalating to the next escalation level:

 

For example:

                       saAmfSGRedundancyModel=nplusm

                        saAmfSGNumPrefActiveSUs=1

                        saAmfSGMaxActiveSIsperSUs=2

                        saAmfSGNumPrefStandbySUs=1

                        saAmfSGMaxStandbySIsperSUs=2

                        saAmfSGCompRestartProb=100000

                        saAmfSGCompRestartMax=1

                        saAmfSGSuRestartProb=20000

                        saAmfSGSuRestartMax=1

 

says

I want 1 active su, 1 standby su, 2 SIs per su in the active state, 2

SIs per su in the standby state, A component should be restarted 1 time

(SGCompRestartMax) and if it fails more then the maximum in the last

100000 milliseconds to escalate to the next level.  This triggers a

restart of the SU which will trigger an escalation to failover if the

service unit is restarted more then 1 time within 20000 milliseconds.

 

[Daniel] Thanks for this. Let me see if I understood the structure:

 

      Cluster

            APP

                  SG1

                        SU1

                              COMP-A

                              COMP-B

                        SU2

                              COMP-A

                              COMP-B

                  SG2

                        SU1

                              COMP-A

                              COMP-B

                        SU2

                              COMP-A

                              COMP-B

                  SI-WL1

                        WL1-1

                        WL1-2

                  SI-WL2

                        WL2-1

                        WL2-2

                  CS-TYPE = A

                  CS-TYPE = B

 

Is this correct?

 

 A failure is detected when the application disconnects from the IPC

connection, or the node fails, or a healthcheck does not respond in the

allocated time period for the healthcheck.

 

the safSi directive specifies the amf components that should be started

within the CSI.  the CSIAttr parameter specifies which attributes should

be passed into the activated component service instance.

 

As can be seen AMF B is quite a bit more complex then AMF A.  You may

have to change your app a bit since it appears monitor was used to

monitor if the application failed.

 

[Daniel] In our case, "Monitor" is just a monitor who monitors addresses
and assigns them when needed; it does not monitor the "LIST" service... 

 

Regards

-steve

 

> 

> > Thank you very much.

> >

> > Daniel.

> >

> > P.S

> > Is there a document with detailed explanation on the AMF.CONF file?
I

> > couldn't find it at the web.

> >

> 

> Take a look at the AMF information model in saiOverview.B0301.pdf.  We

> pretty closely match that model section 5.5 page 76.

> 

> The AMF document itself explains what most of the variables do.

> 

> There is also the not so complete amf.conf man page.

> 

> Regards

> -steve

> 

> 

> > Thanks again, Daniel

> >

> >

> >

> > -----Original Message-----

> > From: Steven Dake [mailto:sdake at redhat.com]

> > Sent: Thursday, September 13, 2007 4:37 AM

> > To: Cohen-Sason Daniel-BDC021

> > Cc: openais at lists.osdl.org

> > Subject: RE: [Openais] Problems with OpenAis version 0.81

> >

> > few extra notes:

> >

> > make sure to change your hostname in the config file back to CENTOS

> >

> > Whatever the command "hostname" returns is what you want in that
field

> >

> > Regards

> > -steve

> >

> > On Wed, 2007-09-12 at 18:29 -0700, Steven Dake wrote:

> > > Daniel,

> > >

> > > Try this amf.conf file and clc_cli_script.

> > >

> > > Essentially the healthcheck keys for the default configuration
file

> > are

> > > invalid and also the path to clc_cli_script do not execute

> > > openais-instantiate.

> > >

> > > With these files 0.81 instantiates the components for me and works

> as

> > > expected.

> > >

> > > Regards

> > > -steve

> > >

> > > On Thu, 2007-09-13 at 00:50 +0300, Cohen-Sason Daniel-BDC021
wrote:

> > > > Hi Steve.

> > > >

> > > > Thanks for the quick response!

> > > > I'm sorry, but I forgot to attach the log I generated.

> > > >

> > > > My hostname is CENTOS, and I set this name at the
saAmfNodeClmNode

> > > > directive (Should it be at the safAmfNode?)

> > > >

> > > > I'm also logged in as root with root group.

> > > >

> > > > I verified and all the 3 files are executables with 755.

> > > >

> > > > I tried to change the timeouts to 5000 and it didn't help.

> > > >

> > > > Attached please find the fixed configuration files and the log.

> > > >

> > > > Hope this will be help you to find out the problem.

> > > >

> > > > Thanks,

> > > >

> > > > Daniel.

> > > >

> > > >

> > > > -----Original Message-----

> > > > From: Steven Dake [mailto:sdake at redhat.com]

> > > > Sent: Wednesday, September 12, 2007 9:14 PM

> > > > To: Cohen-Sason Daniel-BDC021

> > > > Cc: openais at lists.osdl.org

> > > > Subject: Re: [Openais] Problems with OpenAis version 0.81

> > > >

> > > > Without logs it is difficult to tell what went wrong.

> > > >

> > > > I would verify the following things:

> > > >

> > > > You a node that has name service resolution to CENTOS.  That is

> > required

> > > > by the "safAmfNode = AMF1" directive.  If this doesn't match,

> > testamf1

> > > > will never be started.  Put some kind of log output in testamf1
to

> > > > verify it is actually started by AMF.  If it isn't started, this

> is

> > > > likely the cause of the problem.

> > > >

> > > > verify:

> > > > you have mode 755 clc_cli_script in /tmp/aisexample

> > > > you have mode 755 openais-instantiate in /tmp/aisexample

> > > > you have mode 755 testamf1 in /tmp/aisexample

> > > >

> > > > I have noticed sometimes the default time values shipped in

> amf.conf

> > do

> > > > not work very well with some applications resulting in false

> > positive

> > > > failure detections.  Try changing them as follows:

> > > >

> > > >                              saAmfCompDefaultClcCliTimeout =
5000

> > > >                              saAmfCompDefaultCallbackTimeOut =

> 5000

> >

> > > >                              saAmfHealthcheckPeriod = 5000

> > > >

> > > > please send the log file generated by AIS.

> > > >

> > > > Regards

> > > > -steve

> > > >

> > > > On Wed, 2007-09-12 at 08:45 +0300, Cohen-Sason Daniel-BDC021

> wrote:

> > > > > Hello

> > > > >

> > > > >

> > > > > 

> > > > >

> > > > > We, at Motorola, are trying to upgrade to Ver 0.81 from Ver

> 0.70.

> > > > >

> > > > > 

> > > > >

> > > > > The reason why we do it is because we encounter an unstable

> > behavior

> > > > > with 0.70, and hope that 0.81 will be better.

> > > > >

> > > > > 

> > > > >

> > > > > I first started by trying to run the "amfexample1", but for
some

> > > > > reason, the readiness states of the components are always

> becoming

> > > > > OUT_OF_SERVICE (after I run: ./aisexec -f).

> > > > >

> > > > > 

> > > > >

> > > > > I followed the steps which are described in README.AMF.

> > > > >

> > > > > 

> > > > >

> > > > > I also noticed the bindnetaddr should be set, or the

> instantiation

> > > > > won't even start.

> > > > >

> > > > > 

> > > > >

> > > > > Attached please find the log and the configuration files

> > > > > from /tmp/aisexample folder.

> > > > >

> > > > > 

> > > > >

> > > > > I really hope you can help us.

> > > > >

> > > > > 

> > > > >

> > > > > Please advice,

> > > > >

> > > > > 

> > > > >

> > > > > Daniel

> > > > >

> > > > > ________________

> > > > > Daniel Cohen-Sason

> > > > > MCIL, Design Center, Public Safety.

> > > > > Multi-Net-Mobility (MNM).

> > > > > Office:  +972-3-5658548

> > > > > Private: +972-57-5658548 (8548)

> > > > >

> > > > > Daniel.Cohen at motorola.com

> > > > > "The significant problems we have cannot be solved at the same

> > level

> > > > > of thinking with which we created them." [Albert Einstein]

> > > > >

> > > > >

> > > > > 

> > > > >

> > > > >

> > > > > _______________________________________________

> > > > > Openais mailing list

> > > > > Openais at lists.linux-foundation.org

> > > > > https://lists.linux-foundation.org/mailman/listinfo/openais

> > > >

> > > _______________________________________________

> > > Openais mailing list

> > > Openais at lists.linux-foundation.org

> > > https://lists.linux-foundation.org/mailman/listinfo/openais

> >

> 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20070914/96562a16/attachment-0001.htm


More information about the Openais mailing list