[cgl_discussion] Some Comments for App. Heartbeat Service

Zhuang, Louis louis.zhuang at intel.com
Sun Sep 29 18:01:49 PDT 2002


Dear Andrea,
	Adam and I did not read your new code because we just monitored the
CVS at source and ODSL. Today I found your new tar ball in homepage is
*more* up-to-date than CVS. We'll re-read your new implementation.

Louis Zhuang, SW Engineer, Intel Corporation.
My opinions are my own and NEVER the opinions of Intel Corporation.

-----Original Message-----
From: Brugger, Andrea L 
Sent: Monday, September 30, 2002 2:28 AM
To: Zhuang, Louis; cgl_discussion at osdl.org
Subject: RE: [cgl_discussion] Some Comments for App. Heartbeat Service

ok...I miss-read the daemon question, sorry about that.  I wasn't aware of
the daemon() call in Linux so thank you for pointing this out.  I'll update
the code to reflect this...

Andrea

> -----Original Message-----
> From: Brugger, Andrea L [mailto:andrea.l.brugger at intel.com]
> Sent: Sunday, September 29, 2002 11:07 AM
> To: Zhuang, Louis; cgl_discussion at osdl.org
> Subject: RE: [cgl_discussion] Some Comments for App. Heartbeat Service
>
>
> Most of the comments regarding the App. Heartbeat Monitor
> daemon are out of
> date with respect to the latest code on sourceforge (as I
> mentioned in my
> announcement on Friday).  Please see below for more details....
>
>
> > -----Original Message-----
> > From: Zhuang, Louis [mailto:louis.zhuang at intel.com]
> > Sent: Saturday, September 28, 2002 8:30 PM
> > To: cgl_discussion at osdl.org
> > Subject: [cgl_discussion] Some Comments for App. Heartbeat Service
> >
> >
> > Dear All,
> >         Following  are our comments for Application
> > Heartbeat. Wish to be
> > helpful.
> > With regards.
> > -Adam & Louis
> >
> >
> > 1. Comments for Design:
> >         a. Loose-Couple Design Might Be Better
> >         In current design, an application cannot have
> > heartbeat without
> > heartbeat monitor. Contrastively, human being has heartbeat
> without a
> > stethoscope. :) There is something unnatural. The application
> > heartbeat
> > framework should allow an application to own heartbeat even
> > though monitor
> > daemon is not running. The monitor should auto-probe existing
> > applications'
> > heartbeat and should auto-discover new added
> > heartbeat-enabled applications.
> > That is, in application's register phase, the application should not
> > establish a direct communication channel with the monitor.
> Instead, it
> > should left some information (such as it's id, recovery
> > script etc.) in some
> > special place (such as /var/hbspool/<app_name>). The monitor
> > can find these
> > heartbeat-enabled applications later automatically. There
> are several
> > advantages:
>
>
> In the "current" design, an application *can* have a heartbeat and
> no monitor.  The clients can start at any time without regard to the
> heartbeat monitor being started. Registrations are maintained in
> /var/apphbd.
>
>
> >                 i. Applications do not depends on the
> monitor. During
> > applications' life cycle, the monitor may crash without influencing
> > applications in current design. So why must the monitor be
> > there at the
> > start time of application?
>
>
> (no longer valid question)
>
>
> >                 ii. Monitor can probe application heartbeat
> > without any
> > persistence registration information. In current design,
> > monitor leaves a
> > persistence file in order to restore session information in the next
> > execution. But... if the monitor crashes accidentally, the
> > persistence file
> > may be not consistent at all. Restoring from such a file will
> > causes ANY
> > problems.
>
>
> There is no more nasty persistence file. So, this is
> not an issue.  If the monitor crashes, the registration info is still
> in /var/apphbd and untouched (not in a file maintained by
> the daemon). 
>
>
> >                 iii. World will be peace.:) Monitor and
> > monitee is not tight
> > couple inherently. Just think if all monitors (resource
> > monitor, heartbeat
> > monitor, performance monitor etc) work in this way, what a
> > big precondition
> > it is for running an all-enabled application!
> >
>
> I don't understand what this is referring to or what exatly you
> are trying to say... ???
>
>
> >         b. Design Might Be Friendly With Unix Script
> >         Inittab might not be good place for ANY daemon.
> > /etc/init.d/ might
> > be better. Spawning the monitor from crash is relatively
> > easy. If employing
> > loose couple design pervious, the monitor can be started any
> > time, so you
> > can put it with his monitees in any sequence.
>
> You can start the monitor anytime.  The desire for having it
> started by
> init was just so you can have it restarted if the process goes away. 
> We can certainly change this if we decide we have a better solution
> for restarting the daemon.
>
>
> >         c. Design Might Consider Security/Safety
> >         The monitor will execute 'recovery script' when
> > monitee is dead or
> > hard to response. It is a security hole. Any malice
> > application can make
> > monitor execute a script. There are some technically
> problems for the
> > monitor telling monitee's id in IPC. (See also James' comments
> > http://bugzilla.carrierlinux.org/show_bug.cgi?id=701). By
> the way, in
> > loose-couple design, monitor know who is the monitee instead
> > of just a guy
> > on the other side of IPC. So the problem is much easier.
>
> The bug is no longer valid against the current code as the
> implementation
> is done completely different.
> I hope that people will look at the *new* implementation and help
> determine if the new way is indeed better and that it addresses this
> security issue.  The reason we didn't use sockets, for example, is
> because we didn't want the client to have to reconnect to the monitor.
> The design was targeted for the client to be able to go its merry way
> *without* having to rely on the status of the monitor (i.e.
> has it started
> yet
> or has it crashed, and when can it re-register, etc.)  With
> IPC or sockets,
> the client must know when the monitor is not around and has
> to re-register
> if it goes away and comes back.  If someone knows of a way that we
> can still do this that we haven't looked into yet, please let me know.
>
> > 2. Comments for Implementation
> >         a. Daemon
> >         Steven's UNP is perfect. But... there is already a
> > daemon() function
> > in Unics. So... why copy&paste it from UNP?
>
>
> I thought I did cut&paste it from UNP?? what did I miss? Maybe I had
> an old book?  Please cut and paste it here and I will replace
> the version
> that I cut and pasted from the book. Thanks.
>
>
>
> >         b. One registration, One thread
> >         The monitor spawns a thread for one registration. It
> > is prone to
> > exhaust resources. About 200 registrations, in our testing,
> > the monitor
> > occupies >400MB memory and performance is not acceptable.
>
> There is still a thread for each client, but hopefully the performance
> is better.  I'm hoping that people can help out here with a solution
> to this issue rather than continually restating the problem. 
> Each thread
> just uses a select() system call so if it returns before receiving a
> heartbeat,
> then we determine if the client has really failed.  Can we do
> this and not
> use threads,
> yet still adhere to the timing conditions needed by the clients?? 
>
> On another note...the client can send a maximum
> period for which the daemon should wait for a heartbeat, but
> the client can
> send the heartbeat on a shorter interval and be ok.
>
>
> > 3. Trivial Comments
> >         a. Misuse function 'strtol'
> >         There is such a code segment in source.
> >
>
> This code is no longer in the project so we don't have to worry about
> this :)
>
> > --------------------------------------------------------------
> > -----------
> > char   endPtr[STRING_LENGTH] = {'\0'};
> > char   tempString[STRING_LENGTH];
> >          ...
> > n = strtol(tempString, (char **) &endPtr, 10);
> >         ...
> > if (endPtr[0] != '\0') {
> >         ...
> > }
> > --------------------------------------------------------------
> > --------------
> > It is INCORRECT usage. Following code segment demonstrate the
> > right usage.
> > --------------------------------------------------------------
> > --------------
> > --
> >    char    *endptr;
> >    char    a[]="10311977";
> >    int     n;
> > 
> >    n = strtol(a, &endptr, 10);
> >    if (a[0]!='\0' && (*endptr)=='\0') {
> >            //that's OK
> >            ...  
> >    }
> >    ...
> > --------------------------------------------------------------
> > --------------
> > ---
> >
> >
> >         b. Concurrent registertion might fail:
> >         AHM polls message queue every 3 seconds for new registration
> > request. In my test, if multiple registration requests are
> sent out at
> > almost the same time, registration will fail -- I am going to
> > submit bug for
> > this. Currently the sample provided by developer register multiple
> > applications every 5 seconds. But there is no document
> > mentioning this "3
> > seconds" constraint.
> >
>
> This and the rest of the email are no longer a valid concern. 
> Please see current documentation and code on
> sourceforge.  I will try to get these two in sync today, but I wanted
> to be very sure nothing messed up the build before I submitted it
> to CGL.  From the sourceforge homepage, you can download the tarball
> with the latest source and documentation (the documentation is also
> available
> via links from the home page as well.)
>
> Thanks Louis for your feedback, I look forward to more
> feedback with the
> latest code so we can start addressing new concerns...
>
> Andrea
>
>
> >         c. Return values:
> >         Most of the defects are related with return values,
> > for example:
> >                  i.
> >                  In appHeartbeatMonitor.c:
> > RegisterApplication(..) , this is
> > an API provided for user to register heartbeat monitor service
> > --------------------------------------------------------------
> > --------------
> > --------------------------------
> > {
> > ...
> >    if (NULL == *ap_regInfo || NULL == ap_clientInfo)
> >    {
> >       return 4;  
> >    }
> > ....
> >   data = GetInitialPing(p_regInfo->readDescriptor, &p_pongFifoName,
> > timeout);
> >
> >    if (data <= 0)
> >    {
> >       return 4;
> >    }
> > ..........
> > }
> > --------------------------------------------------------------
> > --------------
> > ----------------------------------
> >
> > These are two different error situations. So the functions
> > should return two
> > different error codes. However it returns '4' for both
> > situations. Besides,
> > since the return value is not defined as 'enum' or constant
> > value, it is not
> > convenient for user to judge the reason of failed registration.
> >
> > ii.
> > In appHeartbeatMonitor.c : Initialize(..) //This is also an
> > api provided to
> > user application
> > --------------------------------------------------------------
> > --------------
> > -----------------------------------
> > {
> >    p_regInfo->readDescriptor = CreateFifo(p_regInfo->readFifoName);
> >
> >    if (0 == p_regInfo->readDescriptor)
> >    {
> >       unlink(p_regInfo->readFifoName);
> >       free(p_regInfo);
> >       p_regInfo = NULL;
> >       return -1;
> >    }
> > }
> > 
> > --------------------------------------------------------------
> > --------------
> > ------------------------------------------
> >
> > However, CreateFifo() will never return '0'. It's return
> > value is either -1,
> > -2, or > 0. So the 'if' statement here is usless.
> >
> > 3.c
> > --------------------------------------------------------------
> > --------------
> > --------------------------------------------------
> > int ReadPing(void **ap_regInfo, struct timeval a_timeout)
> > {
> > ........
> >    if (NULL == *ap_regInfo )
> >    {
> >       return -EINVAL;  
> >    }
> > ...............
> > }
> >
> > int SendPong(void *ap_regInfo)
> > {
> > ....................
> >    if (NULL == ap_regInfo )
> >    {
> >       errno = EINVAL;
> >       return EINVAL;  
> >    }
> >    
> > ..............................
> >    /* if we already replied, don't resend */
> >    if (p_regInfo->lastData < 0)
> >    {
> >       errno = EINPROGRESS;
> >       return EINPROGRESS;
> >    }
> > 
> > ..............................
> >    if (result != 0)
> >    {
> >       errno = EPIPE;
> >       return EPIPE;
> >    }    
> > }
> >
> >
> > int RegisterApplication(..)
> > {
> > ...............   
> >  if (p_regInfo->writeDescriptor < 0 )
> >       {
> >          unlink(p_regInfo->writeFifoName);
> >          return errno;
> >       }
> > ...........
> > }
> > --------------------------------------------------------------
> > --------------
> > ----------------------------------
> > I am not sure the reason to use " return -EINVAL", "return
> > EINVAL", and
> > "return errno" in these different situations.
> >
> >         d. Possible memory leak
> > 
> > --------------------------------------------------------------
> > --------------
> > -----------------------------------------
> > In appHeartbeatMonitor.c RegisterApplication(..)
> > {
> >             char  *p_pongFifoName;
> > }
> >
> > p_pongFifoName the memory is allocated in
> >
> > int ReadInitialPing(int a_fifoDescriptor, char ** a_pongFifoName)
> > {
> >    if (bytes == sizeof(ping))
> >    {
> >       *a_pongFifoName = strdup(ping.fifoName);
> >    }
> >
> > }
> > 
> > --------------------------------------------------------------
> > --------------
> > ------------------------------------------
> >          p_pongFifoName is not free when RegisterApplication exits
> >
> >         e. Typo
> > In Main.cpp
> > 
> > --------------------------------------------------------------
> > --------------
> > ----------------------------------------------------
> > #ifdef DEBUG  
> >    ErrorHandler::Initialize("AppHeartbeatMonitor",
> >                              STDERR_LOGGING_ON,
> >                              LOG_DAEMON);
> > #else
> >    ErrorHandler::Initialize("AppHeartbeatMonitor",
> >                              STDERR_LOGGING_ON,  //
> STDERR_LOGGING_OFF
> >                              LOG_DAEMON);
> > #endif
> > --------------------------------------------------------------
> > --------------
> > ------------------------------------------------------
> > From the context, it seems the second Initialze should be
> > "STDERR_LOGGING_OFF" instead of "STDERR_LOGGING_ON".
> >
> > f. In some place the *notes* of functions are not consistent
> > with the code.
> > _______________________________________________
> > cgl_discussion mailing list
> > cgl_discussion at lists.osdl.org
> > http://lists.osdl.org/mailman/listinfo/cgl_discussion
> >
> _______________________________________________
> cgl_discussion mailing list
> cgl_discussion at lists.osdl.org
> http://lists.osdl.org/mailman/listinfo/cgl_discussion
>



More information about the cgl_discussion mailing list