[cgl_discussion] Some Comments for App. Heartbeat Service

Zhuang, Louis louis.zhuang at intel.com
Sat Sep 28 20:29:53 PDT 2002


Dear All,
        Following  are our comments for Application Heartbeat. Wish to be
helpful. 
With regards. 
-Adam & Louis


1. Comments for Design:
        a. Loose-Couple Design Might Be Better
        In current design, an application cannot have heartbeat without
heartbeat monitor. Contrastively, human being has heartbeat without a
stethoscope. :) There is something unnatural. The application heartbeat
framework should allow an application to own heartbeat even though monitor
daemon is not running. The monitor should auto-probe existing applications'
heartbeat and should auto-discover new added heartbeat-enabled applications.
That is, in application's register phase, the application should not
establish a direct communication channel with the monitor. Instead, it
should left some information (such as it's id, recovery script etc.) in some
special place (such as /var/hbspool/<app_name>). The monitor can find these
heartbeat-enabled applications later automatically. There are several
advantages:
                i. Applications do not depends on the monitor. During
applications' life cycle, the monitor may crash without influencing
applications in current design. So why must the monitor be there at the
start time of application?
                ii. Monitor can probe application heartbeat without any
persistence registration information. In current design, monitor leaves a
persistence file in order to restore session information in the next
execution. But... if the monitor crashes accidentally, the persistence file
may be not consistent at all. Restoring from such a file will causes ANY
problems.
                iii. World will be peace.:) Monitor and monitee is not tight
couple inherently. Just think if all monitors (resource monitor, heartbeat
monitor, performance monitor etc) work in this way, what a big precondition
it is for running an all-enabled application!

        b. Design Might Be Friendly With Unix Script
        Inittab might not be good place for ANY daemon. /etc/init.d/ might
be better. Spawning the monitor from crash is relatively easy. If employing
loose couple design pervious, the monitor can be started any time, so you
can put it with his monitees in any sequence.
        c. Design Might Consider Security/Safety
        The monitor will execute 'recovery script' when monitee is dead or
hard to response. It is a security hole. Any malice application can make
monitor execute a script. There are some technically problems for the
monitor telling monitee's id in IPC. (See also James' comments
http://bugzilla.carrierlinux.org/show_bug.cgi?id=701). By the way, in
loose-couple design, monitor know who is the monitee instead of just a guy
on the other side of IPC. So the problem is much easier.
2. Comments for Implementation
        a. Daemon
        Steven's UNP is perfect. But... there is already a daemon() function
in Unics. So... why copy&paste it from UNP?
        b. One registration, One thread
        The monitor spawns a thread for one registration. It is prone to
exhaust resources. About 200 registrations, in our testing, the monitor
occupies >400MB memory and performance is not acceptable.
3. Trivial Comments
        a. Misuse function 'strtol'
        There is such a code segment in source.

-------------------------------------------------------------------------
char   endPtr[STRING_LENGTH] = {'\0'};
char   tempString[STRING_LENGTH];
         ...
n = strtol(tempString, (char **) &endPtr, 10);
        ...
if (endPtr[0] != '\0') {
        ...
}
----------------------------------------------------------------------------
It is INCORRECT usage. Following code segment demonstrate the right usage.
----------------------------------------------------------------------------
--
   char    *endptr;
   char    a[]="10311977";
   int     n;
 
   n = strtol(a, &endptr, 10);
   if (a[0]!='\0' && (*endptr)=='\0') {
           //that's OK
           ...   
   }
   ...
----------------------------------------------------------------------------
---


        b. Concurrent registertion might fail:
        AHM polls message queue every 3 seconds for new registration
request. In my test, if multiple registration requests are sent out at
almost the same time, registration will fail -- I am going to submit bug for
this. Currently the sample provided by developer register multiple
applications every 5 seconds. But there is no document mentioning this "3
seconds" constraint.

        c. Return values:
        Most of the defects are related with return values, for example:
                 i.
                 In appHeartbeatMonitor.c: RegisterApplication(..) , this is
an API provided for user to register heartbeat monitor service
----------------------------------------------------------------------------
--------------------------------
{
...
   if (NULL == *ap_regInfo || NULL == ap_clientInfo)
   {
      return 4;   
   }
....
  data = GetInitialPing(p_regInfo->readDescriptor, &p_pongFifoName,
timeout);

   if (data <= 0)
   {
      return 4;
   }
..........
}
----------------------------------------------------------------------------
----------------------------------

These are two different error situations. So the functions should return two
different error codes. However it returns '4' for both situations. Besides,
since the return value is not defined as 'enum' or constant value, it is not
convenient for user to judge the reason of failed registration.

ii.
In appHeartbeatMonitor.c : Initialize(..) //This is also an api provided to
user application
----------------------------------------------------------------------------
-----------------------------------
{
   p_regInfo->readDescriptor = CreateFifo(p_regInfo->readFifoName);

   if (0 == p_regInfo->readDescriptor)
   {
      unlink(p_regInfo->readFifoName);
      free(p_regInfo);
      p_regInfo = NULL;
      return -1;
   }
}
 
----------------------------------------------------------------------------
------------------------------------------

However, CreateFifo() will never return '0'. It's return value is either -1,
-2, or > 0. So the 'if' statement here is usless.

3.c
----------------------------------------------------------------------------
--------------------------------------------------
int ReadPing(void **ap_regInfo, struct timeval a_timeout)
{
........
   if (NULL == *ap_regInfo )
   {
      return -EINVAL;   
   }
...............
}

int SendPong(void *ap_regInfo)
{
....................
   if (NULL == ap_regInfo )
   {
      errno = EINVAL;
      return EINVAL;   
   }
    
..............................
   /* if we already replied, don't resend */
   if (p_regInfo->lastData < 0)
   {
      errno = EINPROGRESS;
      return EINPROGRESS;
   }
 
..............................
   if (result != 0)
   {
      errno = EPIPE;
      return EPIPE;
   }     
}


int RegisterApplication(..)
{
...............    
 if (p_regInfo->writeDescriptor < 0 )
      {
         unlink(p_regInfo->writeFifoName);
         return errno;
      }
...........
}
----------------------------------------------------------------------------
----------------------------------
I am not sure the reason to use " return -EINVAL", "return EINVAL", and
"return errno" in these different situations.

        d. Possible memory leak
 
----------------------------------------------------------------------------
-----------------------------------------
In appHeartbeatMonitor.c RegisterApplication(..)
{
            char  *p_pongFifoName;
}

p_pongFifoName the memory is allocated in

int ReadInitialPing(int a_fifoDescriptor, char ** a_pongFifoName)
{
   if (bytes == sizeof(ping))
   {
      *a_pongFifoName = strdup(ping.fifoName);
   }

}
 
----------------------------------------------------------------------------
------------------------------------------
         p_pongFifoName is not free when RegisterApplication exits

        e. Typo
In Main.cpp
 
----------------------------------------------------------------------------
----------------------------------------------------
#ifdef DEBUG   
   ErrorHandler::Initialize("AppHeartbeatMonitor",
                             STDERR_LOGGING_ON,
                             LOG_DAEMON);
#else
   ErrorHandler::Initialize("AppHeartbeatMonitor",
                             STDERR_LOGGING_ON,  // STDERR_LOGGING_OFF
                             LOG_DAEMON);
#endif
----------------------------------------------------------------------------
------------------------------------------------------
>From the context, it seems the second Initialze should be
"STDERR_LOGGING_OFF" instead of "STDERR_LOGGING_ON".

f. In some place the *notes* of functions are not consistent with the code.



More information about the cgl_discussion mailing list