jon.maloy at ericsson.com
Thu Feb 13 14:58:19 PST 2003
Andy Pfiffer wrote:
On Thu, 2003-02-13 at 11:13, Rod Van Meter wrote:
First off, this looks like very useful functionality. I'm happy to
see it. And it comes with documentation, too!
I've made a first pass through the PDF file that describes what it is
and how it does it.
Cold-medicine-induced rambling follows: ;^)
The first thing that strikes me is that it is similar in routing and
neighbor discovery to several distributed memory message-passing systems
developed during the mid-'80's. They were characterized as systems
composed of nodes connected only by point-to-point networks, and all
routing was performed by store-and-forward of messages by the nodes
within the system. Several platforms were built this way, including
those based upon INMOS Transputers, the early nCUBE, the Intel iPSC 1,
and a few-other early hypercubes.
You are right. The protocol is focused on intra-cluster communication
with full connectivity, and when configuring otherwise one has to be
aware of the limitations. It is not described in the document yet, but
way inter-cluster links are set up now means that each processor has at
two links to any neigbouring cluster. Hence there will at most be one
step at TIPC level per message. Even this can be avoided if the most
traffic nodes can be identified. One can manually set up links between a
pair of processors.
Having said this, the inter-cluster links have not been used within our
products, so we do not really know their limitations. Until recently
we have recommended use of TCP for inter cluster communication.
On the other hand 'full connectivity" may mean setting up a
distributed cluster using an ip-based protocol (udp,sctp,tcp...), and
one or more routers without TIPC being aware of this.
If I understand the document correctly, the "Hello" mechanism and
routing table discovery/maintenance used by TIPC can have scaling
complications on very large systems (100's to 1000's of communicating
agents) when configured with insufficient internal connectivity. A
spanning tree-based algorithm for maintaining the routing tables looks
like the ideal solution (nearly all of the needed adjacency information
is present) to apply here, rather than hardware-based broadcast on a
subnet or software-based "replicast."
I can not se that the "hello" mechanism is a limitation, it is only a
multicast sent out over a limited period of time, using (now) an
backoff algorithm to determine frequency.
A potentially bigger problem is the background supervision of links when
we have hundreds of them in each processor. I have made some
on this, and with modern processor speed, memory amounts and bandwidth
available we should be able to handle clusters with ~1000 nodes without
any significant background load. (Remember that while total number of
links grow as (nodes^2) for the whole cluster, number of links to
*per node* still only grows at a linear rate.) This would have been a
in the 90:s, but as long as processor speed keeps evolving (a lot)
cluster sizes this does not pose any serious problem. Moore's law is
About keeping naming and routing tables up to date, there is certainly
ways of doing this, but it has served us well so far, with clusters of
Maybe something for the TODO list...
The "zone" abstraction is also similar to techniques developed for
buffer management and flow-control in the high-performance
message-passing present on systems like the Intel iPSC2 and the Intel
Paragon. In those systems, all-to-all communication needed to be
supported, but the O(N^2) time and space requirements rapidly became
prohibitive with 100's of nodes.
As already said, this was an insurmountable problem some years ago, but
not with today's processors and switches. 1000 nodes means 1998 links
to maintain per node, meaning one supervision timer expiring each 0.5
with today's supervision rate. An in most cases the timer will wake up
to do nothing,
given the way the protocol works.
Certainly a challenge, but not impossible, an we are talking about an
which I don't think is very relevant right now. The day we see such
clusters, I am
certain that we will have processor to deal with it as well.
Internally, NX message passing
maintained an LRU of "nearest logical neighbors", and transparently
handled the attach/detach dynamically between one node and a set of
other nodes. TIPC appears to be similar, at least in the description,
of that kind of behavior.
I'm curious as to the behavior of the protocol in some of the strange
boundary conditions, as in the case where the reroute counter of a
message has expired and the system is attempting to return it to the
sender, what happens if all routes to the original sender are cut or if
the sender has been removed?
It will be dropped. What else is there to do...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the cgl_discussion