[cgl_discussion] Request for review of App. Heartbeat Service

Brugger, Andrea L andrea.l.brugger at intel.com
Sun Sep 29 10:35:48 PDT 2002


Hi Louis, 

In my latest announcement regarding the Application Heartbeat Monitor
daemon, I referred to new documentation and the latest source code being
available on the sourceforge webpage
(http://appheartbeatmon.sourceforge.net).  

Many things have changed (for one, the HLD from the MS Word document no
longer exists and there is now html documentation available, which is
up-to-date).  Other things have changed too, in response to the bugzilla
bugs.  The reference to a ping/pong is no longer valid as there is just a
single heartbeat generated by the client which the daemon listens for.  

Andrea 

-----Original Message-----
From: Zhuang, Louis 
Sent: Friday, September 27, 2002 9:00 PM
To: Cress, Andrew R; cgl_discussion at osdl.org
Cc: Li, Adam; Brugger, Andrea L
Subject: RE: [cgl_discussion] Request for review of App. Heartbeat
Service


Andrew,
	I'm not familiar with MCLX. But I guess there are different
situations between MCLX cluster heartbeat and CGLE application heartbeat. 
>From some materials (written by you etc. :)), I guess the cluster heartbeat
daemon just ping to other monitored machines. Because receiving 'Ping' and
responding 'Pong' is done in kernel of monitored machine, the cluster
heartbeat daemon will receive 'Pong' very soon only if network is fast
enough. But the problem is, how can the cluster heartbeat can make sure send
out 100 'Ping' per one second? I guess it uses RTC. If it finds there are
several timer interrupts (i.e. 3) since last reading /dev/rtc, it will send
out 3 'Ping'. In this way, the cluster heartbeat daemon can send out 100
'Ping' per one second even though the time interval is NOT precise 10 ms.
In application heartbeat, the situation is even worse. All are in
user-level. So a heartbeat cycle needs at least twice scheduling (the
monitor sends 'ping' -> the app receives 'ping' and sends 'pong' -> monitor
receives 'pong'). So... the interval time will be no less than 20ms
optimistically. But... in high-load environment, it is much longer than
that.

P.S. Typical application will embed heartbeat response in its main loop so
that the monitor can make sure the application is not blocked. Interval time
means maximal time which main loop runs once.  From this view, is it
reasonable to set heartbeat time into 10ms?

Louis Zhuang, SW Engineer, Intel Corporation.
My opinions are my own and NEVER the opinions of Intel Corporation.

-----Original Message-----
From: Cress, Andrew R 
Sent: Friday, September 27, 2002 9:09 PM
To: Zhuang, Louis; cgl_discussion at osdl.org
Cc: Li, Adam; Brugger, Andrea L
Subject: RE: [cgl_discussion] Request for review of App. Heartbeat Service

I know that MCLX NetGuard has a 10 msec heartbeat timer for clustering, and
it runs in user-space, but it does require that the kernel have some kind of
high-res or real-time timers, which CGLE does have.  So, I think 10 msec is
doable, in CGLE.

Andy

-----Original Message-----
From: Zhuang, Louis [mailto:louis.zhuang at intel.com]
Sent: Friday, September 27, 2002 2:06 AM
To: cgl_discussion at osdl.org
Cc: Li, Adam; Brugger, Andrea L
Subject: RE: [cgl_discussion] Request for review of App. Heartbeat Service

Dear all,
I noticed that the sentence in requirement doc "Application heartbeat time
shall be granular to at least 10 msec resolution
(at least 100 heartbeats/second).". According the scheduling time-slice in
Linux (1/100 second), it might be a mission impossible. :)

Louis Zhuang, SW Engineer, Intel Corporation.
My opinions are my own and NEVER the opinions of Intel Corporation.

-----Original Message-----
From: Brugger, Andrea L [mailto:andrea.l.brugger at intel.com]
Sent: 2002?9?14? 6:41
To: 'cgl_discussion at osdl.org'
Subject: [cgl_discussion] Request for review of App. Heartbeat Service



Project Review Request for Application Heartbeat Service
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

1. Quote the requirements from the requirements doc that your project is
expected to meet.

Requirement 3.3 Application Heartbeat Monitor --
" OSDL CGL shall provide an application heartbeat service that allows
applications
to register to be monitored via standard APIs and allows a registered
application
to be monitored by the service. This mechanism shall use periodic
synchronized
events (heartbeats) between an application and the monitor. If a registered
application fails to provide a heartbeat, the monitor shall report the event
via the
system event log.
Application heartbeat time shall be granular to at least 10 msec resolution
(at least
100 heartbeats/second).
The application heartbeat service shall be available to any process or
sub-process
(thread) entity on the system. A process or thread may register for multiple
heartbeats. Each heartbeat request can have its own parameters to specify
heartbeat granularity.
This requirement does not specify a maximum number of concurrent heartbeat
registrants that the monitor can handle. However, if the monitor cannot
handle
any additional registrants, the request will return a specific error so the
registrant
will know this.
The application heartbeat service requires a registrant to specify a unique
identifier upon registration. If the given identifier is not unique, an
error will be
returned. The registrant may use its PID or choose some other system-unique
value. The latter necessary if a single process wishes to register for
multiple
heartbeats."

2. Explain how you think the project you have picked meets the above
requirements.

The project implements all of the above requirements, in addition, it
provides persistence of registration information if, for some reason, the
service goes down.  During this time, if an application hangs or crashes,
when the service is restarted, it will detect the application failure.

3. Explain the design of your project or point to a document on the web
that explains the design.

The documentation is in the form of doxygen comments within the code and may
be extracted using "make docs" (you must have doxygen on your system).  You
can also go to http://sourceforge.net/projects/appheartbeatmon.

4. Pointer to your code/patch.

The latest code (revision 12) is available from anonymous CVS access as
explained on the sourceforge website.  You can also obtain a copy of the
code from components/AppHeartbeatMonitor directory in CVS on
developer.osdl.org.



-=-=-=-=-=-=-=-=-=-=-=-=-=-
Andrea Brugger
Software Engineer
Intel Corporation -- Telecom Software Programs
503-677-5711

This email message does not represent or express the opinions of Intel
Corporation.
_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion
_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion



More information about the cgl_discussion mailing list