[Openais] experimental totem zero copy branch created
Steven Dake
sdake at redhat.com
Sat Aug 14 14:17:52 PDT 2010
I have created an experimental totem zero copy branch in the
corosync/branches directory of our repository. This message explains
the purpose of this branch.
Most of the developers would like to see corosync go single thread, as
mentioned before regarding the experimental cpgol branch. The issue
that prevents this is that IPC in general is slow (on the order of
50-200k ipc operations per second) using standard solutions such as
PF_LOCAL sockets or posix message queues. In general, this is fast
enough for nearly all client/server communication operations, except for
cpg_mcast_joined() C api.
During corosync development, we introduced the coroipc layer to allow
multiple cores to participate in copying messages around to the totem
single thread protocol implementation. While this definitely can
consume more cpu/memory bandwidth of a system, it introduced several
other problems common to multithreading.
The cpgol experimental concept looked interesting, but I found the
dependency chain and complexity introduced into clients to be too
difficult to deal with.
Having a good hard look at performance of Totem, essentially corosync
spends most of its time inside memcpy() operations or mutual exclusion
related to single threading multiple IPC threads contending for the
totem resource.
The exp-zc branch is intended to serve as an experiment in removing
nearly all memory copy operations from the corosync core.
The basic concept is to create memory map blocks which are shared
between individual clients and the corosync server. Copies occur in the
following conditions:
1. cpg_mcast_joined() copies message blocks into shared memory segments
when sending a message
2. cpg_dispatch() may copy message blocks from shared memory segments
when delivering a message larger then the MTU frame size (ie: it needs
to assemble a large packet). If the packet is smaller then the frame
MTU, it should deliver it without copy
3. If totem is compacting messages into one network frame, it may memcpy
multiple smaller buffers into a frame sized packet
4. When totem sends a message via UDP, sendmsg() executes a copy operation
5. when totem receives a message via UDP, recvmsg() executes a copy
operation
In the case of 1 above, totem will execute case 4 without copying the
full message contents over ipc.
In the case of 2 above, totem will execute case 5 without copying the
full message contents over ipc.
Since the token serves as the event for new messages to be transmitted,
we no longer rely on an IPC event (eliminating the bottleneck in the
kernel/user socket operation)
The origination of messages does not introduce any extra copy in the
clients (it remains the same as is presently done).
The delivery of messages may introduce extra copying for messages longer
then the network frame size.
Since this branch is experimental, alot of what will be committed to it
will be hacky to begin. Need to get started on some basic ideas before
making a final determination on this method.
Regards
-steve
More information about the Openais
mailing list