[cgl_discussion] AEM analysis summary

Dave Olien dmo at osdl.org
Wed Apr 2 09:45:14 PST 2003

Here is a summary and analysis of the AEM implementation.
I appreciate any comments.

In the discussion below, the term "thread" isn't intended to imply
a POSIX thread.  Thread refers to an execution context,
created via clone(), that may or may not share attributes (memory,
file descriptors, etc) with another thread.  POSIX threads is implemented
as a set of libraries on top of clone().  But using POSIX threads isn't
required, and may be undesirable.  But POSIX threads also shouldn't be
ruled out.


AEM implements an event delivery mechanism that meets many of OSDL's
requirements for such a mechanism.  However, the implementation seems
highly optimized for a specific set of applications.  Since the optimization
is done intrusively in the kernel, the chance for mainline acceptance
is zero. As such, it is more intrusive into the kernel than is desirable
for OSDL.  The AEM implementation will require substantial ongoing
maintenance to keep up with kernel revisions.  Transitioning from
linux 2.4 to linux 2.5 will require substantial updates to the AEM

There are also indications that the AEM implementation is still somewhat
of a prototype, as evidenced by some unused code in the kernel.

A less optimized implementation that accomplishes most of what AEM
implements could be implemented using user-mode libraries and poll/epoll.
This library could be supplemented with a kernel-mode loadable module
that uses only kernel interfaces that are normally exported by the kernel
so its implementation should remain stable over time, leading to a lower
maintenance burden.  If kernel enhancements are needed, they should be
general purpose in nature, with the hopes of having them become part of
the mainline kernel.

We are also interested in the SAForum requirements for event notification,
and are awaiting publication of their API for evaluation.

We are also interested in an API that includes a publish/subscribe model
for registering event types and registering to receive events.

It would be useful to have source code for a few simple applications
that use the AEM API.  This could include code for measuring performance,
code used for regression testing, etc.

Survey of AEM

AEM invents a number of abstractions.

	job: 	The action involved in notifying one or more processes.
		jobs make use of the kernel's tasklet context to do
		their work in a light-weight manner.

	event:	The act of notifying a process that a system event has occurred
		that is of interest to that process.  Notification may
		include delivery of (large amounts of) data associated with
		that system event.  The AEM kernel manipulates the
		process's user space to allocate memory for the data to
		be delivered.

		The event notification executes in a signal-handler-like
		context within the target process.

	capsule: Under some circumstances, the application may want event
		notification to occur in the context of a new thread,
		rather than to the application's "main thread".
		AEM implements what amounts to a variation of the clone
		system call that creates a new thread context.  The
		kernel maintains a pool of capsules, to make creation
		of a new capsule context more "light weight".
AEM adds 13 new kernel entry points.  These are:

	do_event()	this is really the entry for the event
			delivery code.  This is called on every
			exit point from the kernel, and also on
			every kernel IRQ  handler entry point.
	sys_capsule()	create "capsule threads". This implements
			a "clone() like " and "exit() like" function.

	sys_evreturn()	does the cleanup function on completion of
			an event delivery

	sys_evctl()	deals with setting attributes on jobs and
			on event delivery

			New socket-related system calls, for registering
			asynchronous delivery of socket-related events.

			A new system call to register for delivery of
			a timer event.


Analysis of implementation.

In many ways, AEM implements variations on functions already
implemented in the main kernel.  A maintenance consequence of this is
that whenever the kernel changes in any basic way, the AEM implementation
must change to track it. This type of variations based on already existing
kernel functionality will never be accepted into mainline kernel in this

For examples:

1. capsules implement a variation on clone() and exit().

	Linux 2.5 did a lot of work to clone() to improve the
	implementation of threads.  The AEM implementation
	of capsules will require rework for linux 2.5

2. event delivery in AEM is in many ways a variation on signal delivery.
	But it is a mechanism that is in addition to standard signals.
	So wherever there were previously tests for signal delivery, there
	is now also a test for event delivery.  This adds a little bit of
	code to every fast-path kernel exit.  But AEM has chose to also
	add this test for event delivery on every signal handler's entry
	point.  This is a significant new addition to the fast path for
	signal delivery.

	In linux 2.5, the kernel entry/exit code has been re-implemented.
	The x86 sysenter/sysexit instructions are now the preferred
	method for entering and returning from the kernel. The AEM event
	delivery code will need to be integrated into the new 2.5 code.

3. Event Data delivery in AEM implements variations on kernel memory management

	AEM implements its own memory buddy-allocator.
	This is a special-case implementation of what the kernel already
	does in it's buddy allocation schemes.

	The memory handled by AEM's buddy allocator will be
	mapped into the application's user space to contain event data.
	This implementation modifies the user's page tables, and calls to

	Linux 2.5 has made substantial changes to memory management.
	In particular, the 2.5 implementation of HIGHMEM, using x86 PAE
	and transient mapping of page tables may require significant
	change to this code in AEM.

4. Soft Realtime Scheduling Priority

	The AEM boosts the scheduling priority of whatever thread
	context is the target of an event delivery, to ensure the
	event is handled in a quickly.

	Linux 2.5 has re-written the scheduler.  This priority boost
	code in AEM will need to be re-written.  The 2.5 O(1) scheduler
	now requires tasks be moved between priority queues, and possibly
	migrated between per-process queues when its priority changes.
	Really, AEM should probably call an exported kernel function, such
	as "set_user_nice()" (which is an EXPORTED kernel function).

5. Modifications to sockets and protocol stacks

	AEM modifies the kernel's sockets code to add event delivery
 	a socket bind time, when data is available on a socket, etc.

	This also has required modification to code in the TCP protocol
	stack and also in other protocol stacks.  To some degree, this
	is analogous to changing the VFS interfaces for file systems.
	It requires modification to every protocol stack

Analysis of Application Programming Model

The AEM event delivery mechanism is built on a variation of signal delivery.
But, it is a mechanism IN ADDITION to signals.  A consequence is that
applications using events will have many of the same application reentrancy
issues as signals.  For example, libc functions that are "async-signal" safe,
often achieve that by blocking signals around critical code in the
library function.  This is important for library functions that may be
called from both an application's mainline code and from signal handlers
in that application.

AEM implements a system call to hold off delivery of events, similar
to blocking signals.  But none of the functions in libc have been
adapted to use that mechanism.  Hence, no libc functions are "async-event"
safe.  The workaround for this is for applications to block event
delivery around calls to functions (e.g. printf()) that are called from
both mainline code and from event handlers in that application.

The same is true for a user-level spinlock that is acquired both in an
event handler and in the mainline aplication. Prior to acquiring the lock,
events should be blocked.  For user-mode spinlocks (futex), it's unfortunate
that a system call needs to be made.  It defeats the purpose of having
user-mode spinlocks.

Proposed Library/Module Implementation

The O(1) scheduler is among the significant improvements in linux 2.5.
It allows a system to efficiently schedule from large set of tasks, and to
efficiently prioritize among those tasks.  This is a major underpinning for
linux 2.5's claimed support for "tens of thousands of threads".

A proposed implementation of event notification should make use of this.
One way would be for the mainline application threads to have a pool of
"epoll" threads.  These epoll threads would share memory with the
mainline thread, and would do blocking calls on epoll() to wait for the
occurence of events.  When an event occurs, the poll thread can collect
the event data, and execute a callback function in the shared-memory context
of the application's mainline thread.

Poll threads could run with an elevated scheduling priority to ensure
they service events quickly.  With privilege, poll threads could be given
user-mode realtime priorities. (as opposed to kernel-mode realtime
priorities, which are reserved to kernel daemons that perform critical

This style of callback avoids the re-entrancy issues a signal-like delivery
has with library functions and mutex.  It's still possible for mutex
to have a-b,b-a style deadlock.  But the recursive mutex deadlock is
eliminated.  This eliminates the need to do a system call to block
event delivery prior to acquiring a mutex.  If the application is using
futex, any contention on a lock between the mainline application and an
event delivery will be efficiently resolved.  Whichever thread (mainline
or poll thread) encounters a futex that is already locked, will block
and release the processor until the lock becomes available.  This leaves
the thread that already holds the lock in a runnable state, and able to
proceed until that thread releases the lock.

It may be useful to implement a kernel module to implement some of the
requirements for event delivery.  The kernel module could implement
distribution of events among more than one mainline thread, prioritization
of events, etc.

Requirements Analysis

There are three sources for event delivery requirements:

The OSDL CGL Requirements for Low-Level Asynchronous Events, by Frederic Rossi

The OSDL Cluster Event Notification API, by Joe DiMartino

1. An Event publication and subscription interface, that
	supports event notification through callbacks or polling.

	Ideally, it should be possible to implement similar interfaces
	in both user and kernel space.

2. Event Topics


The OSDL Carrier Grade Linux Working Grouop Clustering Requirements and
SAForum Application Interface Specification Comparison, by Peter Badovinatz.

More information about the cgl_discussion mailing list