[Openais] experimental totem zero copy branch created

Sat Aug 14 14:17:52 PDT 2010

I have created an experimental totem zero copy branch in the 
corosync/branches directory of our repository.  This message explains 
the purpose of this branch.

Most of the developers would like to see corosync go single thread, as 
mentioned before regarding the experimental cpgol branch.  The issue 
that prevents this is that IPC in general is slow (on the order of 
50-200k ipc operations per second) using standard solutions such as 
PF_LOCAL sockets or posix message queues.  In general, this is fast 
enough for nearly all client/server communication operations, except for 
cpg_mcast_joined() C api.

During corosync development, we introduced the coroipc layer to allow 
multiple cores to participate in copying messages around to the totem 
single thread protocol implementation.  While this definitely can 
consume more cpu/memory bandwidth of a system, it introduced several 
other problems common to multithreading.

The cpgol experimental concept looked interesting, but I found the 
dependency chain and complexity introduced into clients to be too 
difficult to deal with.

Having a good hard look at performance of Totem, essentially corosync 
spends most of its time inside memcpy() operations or mutual exclusion 
related to single threading multiple IPC threads contending for the 
totem resource.

The exp-zc branch is intended to serve as an experiment in removing 
nearly all memory copy operations from the corosync core.

The basic concept is to create memory map blocks which are shared 
between individual clients and the corosync server.  Copies occur in the 
following conditions:
1. cpg_mcast_joined() copies message blocks into shared memory segments 
when sending a message
2. cpg_dispatch() may copy message blocks from shared memory segments 
when delivering a message larger then the MTU frame size (ie: it needs 
to assemble a large packet).  If the packet is smaller then the frame 
MTU, it should deliver it without copy
3. If totem is compacting messages into one network frame, it may memcpy 
multiple smaller buffers into a frame sized packet
4. When totem sends a message via UDP, sendmsg() executes a copy operation
5. when totem receives a message via UDP, recvmsg() executes a copy 
operation

In the case of 1 above, totem will execute case 4 without copying the 
full message contents over ipc.
In the case of 2 above, totem will execute case 5 without copying the 
full message contents over ipc.
Since the token serves as the event for new messages to be transmitted, 
we no longer rely on an IPC event (eliminating the bottleneck in the 
kernel/user socket operation)
The origination of messages does not introduce any extra copy in the 
clients (it remains the same as is presently done).
The delivery of messages may introduce extra copying for messages longer 
then the network frame size.

Since this branch is experimental, alot of what will be committed to it 
will be hacky to begin.  Need to get started on some basic ideas before 
making a final determination on this method.

Regards
-steve