[cgl_discussion] Some Comments for App. Heartbeat Service

Sun Sep 29 11:06:45 PDT 2002

Most of the comments regarding the App. Heartbeat Monitor daemon are out of
date with respect to the latest code on sourceforge (as I mentioned in my
announcement on Friday).  Please see below for more details....

> -----Original Message-----
> From: Zhuang, Louis [mailto:louis.zhuang at intel.com]
> Sent: Saturday, September 28, 2002 8:30 PM
> To: cgl_discussion at osdl.org
> Subject: [cgl_discussion] Some Comments for App. Heartbeat Service
> 
> 
> Dear All,
>         Following  are our comments for Application 
> Heartbeat. Wish to be
> helpful. 
> With regards. 
> -Adam & Louis
> 
> 
> 1. Comments for Design:
>         a. Loose-Couple Design Might Be Better
>         In current design, an application cannot have 
> heartbeat without
> heartbeat monitor. Contrastively, human being has heartbeat without a
> stethoscope. :) There is something unnatural. The application 
> heartbeat
> framework should allow an application to own heartbeat even 
> though monitor
> daemon is not running. The monitor should auto-probe existing 
> applications'
> heartbeat and should auto-discover new added 
> heartbeat-enabled applications.
> That is, in application's register phase, the application should not
> establish a direct communication channel with the monitor. Instead, it
> should left some information (such as it's id, recovery 
> script etc.) in some
> special place (such as /var/hbspool/<app_name>). The monitor 
> can find these
> heartbeat-enabled applications later automatically. There are several
> advantages:

In the "current" design, an application *can* have a heartbeat and
no monitor.  The clients can start at any time without regard to the 
heartbeat monitor being started. Registrations are maintained in
/var/apphbd.

>                 i. Applications do not depends on the monitor. During
> applications' life cycle, the monitor may crash without influencing
> applications in current design. So why must the monitor be 
> there at the
> start time of application?

(no longer valid question)

>                 ii. Monitor can probe application heartbeat 
> without any
> persistence registration information. In current design, 
> monitor leaves a
> persistence file in order to restore session information in the next
> execution. But... if the monitor crashes accidentally, the 
> persistence file
> may be not consistent at all. Restoring from such a file will 
> causes ANY
> problems.

There is no more nasty persistence file. So, this is 
not an issue.  If the monitor crashes, the registration info is still
in /var/apphbd and untouched (not in a file maintained by
the daemon).  

>                 iii. World will be peace.:) Monitor and 
> monitee is not tight
> couple inherently. Just think if all monitors (resource 
> monitor, heartbeat
> monitor, performance monitor etc) work in this way, what a 
> big precondition
> it is for running an all-enabled application!
> 

I don't understand what this is referring to or what exatly you 
are trying to say... ???

>         b. Design Might Be Friendly With Unix Script
>         Inittab might not be good place for ANY daemon. 
> /etc/init.d/ might
> be better. Spawning the monitor from crash is relatively 
> easy. If employing
> loose couple design pervious, the monitor can be started any 
> time, so you
> can put it with his monitees in any sequence.

You can start the monitor anytime.  The desire for having it started by 
init was just so you can have it restarted if the process goes away.  
We can certainly change this if we decide we have a better solution 
for restarting the daemon.

>         c. Design Might Consider Security/Safety
>         The monitor will execute 'recovery script' when 
> monitee is dead or
> hard to response. It is a security hole. Any malice 
> application can make
> monitor execute a script. There are some technically problems for the
> monitor telling monitee's id in IPC. (See also James' comments
> http://bugzilla.carrierlinux.org/show_bug.cgi?id=701). By the way, in
> loose-couple design, monitor know who is the monitee instead 
> of just a guy
> on the other side of IPC. So the problem is much easier.

The bug is no longer valid against the current code as the implementation 
is done completely different. 
I hope that people will look at the *new* implementation and help 
determine if the new way is indeed better and that it addresses this
security issue.  The reason we didn't use sockets, for example, is
because we didn't want the client to have to reconnect to the monitor.
The design was targeted for the client to be able to go its merry way
*without* having to rely on the status of the monitor (i.e. has it started
yet
or has it crashed, and when can it re-register, etc.)  With IPC or sockets, 
the client must know when the monitor is not around and has to re-register
if it goes away and comes back.  If someone knows of a way that we
can still do this that we haven't looked into yet, please let me know.

> 2. Comments for Implementation
>         a. Daemon
>         Steven's UNP is perfect. But... there is already a 
> daemon() function
> in Unics. So... why copy&paste it from UNP?

I thought I did cut&paste it from UNP?? what did I miss? Maybe I had
an old book?  Please cut and paste it here and I will replace the version 
that I cut and pasted from the book. Thanks.

>         b. One registration, One thread
>         The monitor spawns a thread for one registration. It 
> is prone to
> exhaust resources. About 200 registrations, in our testing, 
> the monitor
> occupies >400MB memory and performance is not acceptable.

There is still a thread for each client, but hopefully the performance
is better.  I'm hoping that people can help out here with a solution
to this issue rather than continually restating the problem.  Each thread
just uses a select() system call so if it returns before receiving a
heartbeat,
then we determine if the client has really failed.  Can we do this and not
use threads,
yet still adhere to the timing conditions needed by the clients??  

On another note...the client can send a maximum
period for which the daemon should wait for a heartbeat, but the client can
send the heartbeat on a shorter interval and be ok.

> 3. Trivial Comments
>         a. Misuse function 'strtol'
>         There is such a code segment in source.
> 

This code is no longer in the project so we don't have to worry about
this :)

> --------------------------------------------------------------
> -----------
> char   endPtr[STRING_LENGTH] = {'\0'};
> char   tempString[STRING_LENGTH];
>          ...
> n = strtol(tempString, (char **) &endPtr, 10);
>         ...
> if (endPtr[0] != '\0') {
>         ...
> }
> --------------------------------------------------------------
> --------------
> It is INCORRECT usage. Following code segment demonstrate the 
> right usage.
> --------------------------------------------------------------
> --------------
> --
>    char    *endptr;
>    char    a[]="10311977";
>    int     n;
>  
>    n = strtol(a, &endptr, 10);
>    if (a[0]!='\0' && (*endptr)=='\0') {
>            //that's OK
>            ...   
>    }
>    ...
> --------------------------------------------------------------
> --------------
> ---
> 
> 
>         b. Concurrent registertion might fail:
>         AHM polls message queue every 3 seconds for new registration
> request. In my test, if multiple registration requests are sent out at
> almost the same time, registration will fail -- I am going to 
> submit bug for
> this. Currently the sample provided by developer register multiple
> applications every 5 seconds. But there is no document 
> mentioning this "3
> seconds" constraint.
> 

This and the rest of the email are no longer a valid concern.  
Please see current documentation and code on
sourceforge.  I will try to get these two in sync today, but I wanted
to be very sure nothing messed up the build before I submitted it 
to CGL.  From the sourceforge homepage, you can download the tarball
with the latest source and documentation (the documentation is also
available
via links from the home page as well.)

Thanks Louis for your feedback, I look forward to more feedback with the
latest code so we can start addressing new concerns...

Andrea

>         c. Return values:
>         Most of the defects are related with return values, 
> for example:
>                  i.
>                  In appHeartbeatMonitor.c: 
> RegisterApplication(..) , this is
> an API provided for user to register heartbeat monitor service
> --------------------------------------------------------------
> --------------
> --------------------------------
> {
> ...
>    if (NULL == *ap_regInfo || NULL == ap_clientInfo)
>    {
>       return 4;   
>    }
> ....
>   data = GetInitialPing(p_regInfo->readDescriptor, &p_pongFifoName,
> timeout);
> 
>    if (data <= 0)
>    {
>       return 4;
>    }
> ..........
> }
> --------------------------------------------------------------
> --------------
> ----------------------------------
> 
> These are two different error situations. So the functions 
> should return two
> different error codes. However it returns '4' for both 
> situations. Besides,
> since the return value is not defined as 'enum' or constant 
> value, it is not
> convenient for user to judge the reason of failed registration.
> 
> ii.
> In appHeartbeatMonitor.c : Initialize(..) //This is also an 
> api provided to
> user application
> --------------------------------------------------------------
> --------------
> -----------------------------------
> {
>    p_regInfo->readDescriptor = CreateFifo(p_regInfo->readFifoName);
> 
>    if (0 == p_regInfo->readDescriptor)
>    {
>       unlink(p_regInfo->readFifoName);
>       free(p_regInfo);
>       p_regInfo = NULL;
>       return -1;
>    }
> }
>  
> --------------------------------------------------------------
> --------------
> ------------------------------------------
> 
> However, CreateFifo() will never return '0'. It's return 
> value is either -1,
> -2, or > 0. So the 'if' statement here is usless.
> 
> 3.c
> --------------------------------------------------------------
> --------------
> --------------------------------------------------
> int ReadPing(void **ap_regInfo, struct timeval a_timeout)
> {
> ........
>    if (NULL == *ap_regInfo )
>    {
>       return -EINVAL;   
>    }
> ...............
> }
> 
> int SendPong(void *ap_regInfo)
> {
> ....................
>    if (NULL == ap_regInfo )
>    {
>       errno = EINVAL;
>       return EINVAL;   
>    }
>     
> ..............................
>    /* if we already replied, don't resend */
>    if (p_regInfo->lastData < 0)
>    {
>       errno = EINPROGRESS;
>       return EINPROGRESS;
>    }
>  
> ..............................
>    if (result != 0)
>    {
>       errno = EPIPE;
>       return EPIPE;
>    }     
> }
> 
> 
> int RegisterApplication(..)
> {
> ...............    
>  if (p_regInfo->writeDescriptor < 0 )
>       {
>          unlink(p_regInfo->writeFifoName);
>          return errno;
>       }
> ...........
> }
> --------------------------------------------------------------
> --------------
> ----------------------------------
> I am not sure the reason to use " return -EINVAL", "return 
> EINVAL", and
> "return errno" in these different situations.
> 
>         d. Possible memory leak
>  
> --------------------------------------------------------------
> --------------
> -----------------------------------------
> In appHeartbeatMonitor.c RegisterApplication(..)
> {
>             char  *p_pongFifoName;
> }
> 
> p_pongFifoName the memory is allocated in
> 
> int ReadInitialPing(int a_fifoDescriptor, char ** a_pongFifoName)
> {
>    if (bytes == sizeof(ping))
>    {
>       *a_pongFifoName = strdup(ping.fifoName);
>    }
> 
> }
>  
> --------------------------------------------------------------
> --------------
> ------------------------------------------
>          p_pongFifoName is not free when RegisterApplication exits
> 
>         e. Typo
> In Main.cpp
>  
> --------------------------------------------------------------
> --------------
> ----------------------------------------------------
> #ifdef DEBUG   
>    ErrorHandler::Initialize("AppHeartbeatMonitor",
>                              STDERR_LOGGING_ON,
>                              LOG_DAEMON);
> #else
>    ErrorHandler::Initialize("AppHeartbeatMonitor",
>                              STDERR_LOGGING_ON,  // STDERR_LOGGING_OFF
>                              LOG_DAEMON);
> #endif
> --------------------------------------------------------------
> --------------
> ------------------------------------------------------
> From the context, it seems the second Initialze should be
> "STDERR_LOGGING_OFF" instead of "STDERR_LOGGING_ON".
> 
> f. In some place the *notes* of functions are not consistent 
> with the code.
> _______________________________________________
> cgl_discussion mailing list
> cgl_discussion at lists.osdl.org
> http://lists.osdl.org/mailman/listinfo/cgl_discussion
>