[cgl_discussion] TIPC

Jon Maloy jon.maloy at ericsson.com
Thu Feb 13 14:58:19 PST 2003


See below.

/Jon

Andy Pfiffer wrote:


On Thu, 2003-02-13 at 11:13, Rod Van Meter wrote:

  

First off, this looks like very useful functionality.  I'm happy to

see it.  And it comes with documentation, too!

    



I've made a first pass through the PDF file that describes what it is

and how it does it.



Cold-medicine-induced rambling follows: ;^)



The first thing that strikes me is that it is similar in routing and

neighbor discovery to several distributed memory message-passing systems

developed during the mid-'80's.  They were characterized as systems

composed of nodes connected only by point-to-point networks, and all

routing was performed by store-and-forward of messages by the nodes

within the system.  Several platforms were built this way, including

those based upon INMOS Transputers, the early nCUBE, the Intel iPSC 1,

and a few-other early hypercubes.

You are right. The protocol is focused on intra-cluster communication
with full connectivity, and when configuring otherwise one has to be
aware of the limitations. It is not described in the document yet, but
the
way inter-cluster links are set up now means that each processor has at
least
two links to any neigbouring cluster. Hence there will at most be one
routing
step at TIPC level per message. Even this can be avoided if the most
heavily
traffic nodes can be identified. One can manually set up links between a
given
pair of processors.
Having said this, the inter-cluster links have not been used within our
own
products, so we do not really know their limitations. Until recently
we have recommended use of TCP for inter cluster communication.

On the other hand 'full connectivity" may mean setting up a
geographically
distributed  cluster using an  ip-based protocol (udp,sctp,tcp...), and
passing 
one or more routers without TIPC being aware of this.






If I understand the document correctly, the "Hello" mechanism and

routing table discovery/maintenance used by TIPC can have scaling

complications on very large systems (100's to 1000's of communicating

agents) when configured with insufficient internal connectivity.  A

spanning tree-based algorithm for maintaining the routing tables looks

like the ideal solution (nearly all of the needed adjacency information

is present) to apply here, rather than hardware-based broadcast on a

subnet or software-based "replicast."

I can not se that the "hello" mechanism is a limitation, it is only a
broadcast/
multicast sent out over a limited period of time, using (now) an
exponential
backoff algorithm to determine frequency.

A potentially bigger problem is the background supervision of links when
we have hundreds of them in each processor. I have made some
calculations
on this, and with modern processor speed, memory amounts and bandwidth
available we should be able to handle clusters with ~1000 nodes without
any significant background load. (Remember that while total number of 
links grow as (nodes^2) for the whole cluster, number of links to
maintain 
*per node* still  only grows at a linear rate.) This would have been a
problem 
in the 90:s, but as long as processor speed keeps evolving (a lot)
faster than 
cluster sizes this does not pose any serious problem. Moore's law is
still valid.

About keeping naming and routing tables up to date, there is certainly
better
ways of doing this, but it has served us well so far, with clusters of
~50 processors.
Maybe something for the TODO list...






The "zone" abstraction is also similar to techniques developed for

buffer management and flow-control in the high-performance

message-passing present on systems like the Intel iPSC2 and the Intel

Paragon.  In those systems, all-to-all communication needed to be

supported, but the O(N^2) time and space requirements rapidly became

prohibitive with 100's of nodes.  

As already said, this was an insurmountable problem some years ago, but 
not with today's processors and switches. 1000 nodes  means 1998 links
to maintain per node, meaning one supervision timer expiring  each 0.5
ms 
with today's supervision rate. An in most cases the timer will wake up
to do nothing,
given the way the protocol works.
Certainly a challenge, but not impossible, an we are talking about an
extreme case
which I don't think is very relevant right now. The day we see such
clusters, I am 
certain that we will have processor to deal with it as well.


Internally, NX message passing

maintained an LRU of "nearest logical neighbors", and transparently

handled the attach/detach dynamically between one node and a set of

other nodes.  TIPC appears to be similar, at least in the description,

of that kind of behavior.



I'm curious as to the behavior of the protocol in some of the strange

boundary conditions, as in the case where the reroute counter of a

message has expired and the system is attempting to return it to the

sender, what happens if all routes to the original sender are cut or if

the sender has been removed?

It will be dropped. What else is there to do...






Andy





  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/cgl_discussion/attachments/20030213/480bc057/attachment-0001.htm


More information about the cgl_discussion mailing list