[cgl_discussion] TIPC

Thu Feb 13 16:56:17 PST 2003

/Jon

Corey Minyard wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Jon Maloy wrote:
>
> |>
> |>If I understand the document correctly, the "Hello" mechanism and
> |>routing table discovery/maintenance used by TIPC can have scaling
> |>complications on very large systems (100's to 1000's of communicating
> |>agents) when configured with insufficient internal connectivity.  A
> |>spanning tree-based algorithm for maintaining the routing tables looks
> |>like the ideal solution (nearly all of the needed adjacency information
> |>is present) to apply here, rather than hardware-based broadcast on a
> |>subnet or software-based "replicast."
> |>
> | I can not se that the "hello" mechanism is a limitation, it is only 
> a broadcast/
> | multicast sent out over a limited period of time, using (now) an 
> exponential
> | backoff algorithm to determine frequency.
>
> This depends on the discover time you require.  But worst-case 
> discover times are probably in the 10's of seconds, so it's still not 
> too bad.  But if you have another way to do it, why take the overhead 
> for it? 

Excactly. The Hello protocol is only one, optional, way of neigbour 
detection,
which can be switched off. Any way of telling TIPC about existence of 
other nodes
will do, configuring, hardcoding or other. There is a generic call 
(which can even be done
remotely via a TIPC message ) towards the core for adding links to a 
node, and TIPC
doesn't care  about how  the caller obtained this information.

>
>
> |
> |
> | A potentially bigger problem is the background supervision of links 
> when
> | we have hundreds of them in each processor. I have made some 
> calculations
> | on this, and with modern processor speed, memory amounts and bandwidth
> | available we should be able to handle clusters with ~1000 nodes without
> | any significant background load. (Remember that while total number of
> | links grow as (nodes^2) for the whole cluster, number of links to 
> maintain
> | *per node* still  only grows at a linear rate.) This would have been 
> a problem
> | in the 90:s, but as long as processor speed keeps evolving (a lot) 
> faster than
> | cluster sizes this does not pose any serious problem. Moore's law is 
> still valid.
>
> This depends on your required failure detection times.  If you need 
> 500ms failover times, you are probably going to need 125ms resends.  
> That corresponds to 8000 messages/second 

First, when there is traffic on the links no supervision messages are 
exchanged at all,
 since the timer will detect that the link is active and just go to 
sleep again.
Second, 8000 60-byte-msg/s module-to-module is fully possible if the 
bandwith is there.
(See below). Furthermore, this will only  happen when  all the links are 
idling, i.e.
the processor probably does not have much  else to do.

> you have to receive and send (probably more, since due to timing and 
> latency both sides might decide to transmit the keepalive on the link 
> at the same time. 

No. One party will take control, the other one will only become a 
passive responder, since it will
detect the incoming probes from the peer.

> It's hard for me to imagine that even 1000 msg/sec is trivial to 
> handle on even a modern CPU. 

We have had no problem with exchanging 40,000 msg/s on 700 Mhz Pentium 
III, and that was
process-to-process.
I have great faith in what we can do with 5 Ghz processors in a couple 
of years, and have
plenty of processor capacity left.
Anyway, aren't we discussing a rather hypotethic scenario  now ?   I 
think clusters in the range
of 50-100 of processors will be a big enough  challenge for the near 
future. There are other
(non-communication related) problems to solve before  we can pass beyond 
that limit.

>
>
> I was one of the system architects on a system designed to scale into 
> the 10,000 node range.  Really, if you require short failure detection 
> times, the whole strategy of doing point-to-point links with timers 
> falls apart at around 100 nodes without hardware support.
>
> Techniques do exist for building large systems like this and using 
> very little CPU for overhead. 

TIPC does not in any way preclude that the information about node 
failure can come from somewhere
else, e.g. a network processor. It would be easy to add a call in the 
interface for this, and then
disabling or slowing down the background timer.
However,  I have a hard time imagining that node supervision via 
ethernet can be done without
exchanging probe/heartbeat messages, and that this can be done  much 
more efficiently than in
TIPC.
With dedicated hardware it becomes a different matter, of course...

>
>
> Another, somewhat related question.  How do you handle system 
> partions?  This was the nastiest problem we had to deal with 
> (especially since we were multi-site distributed). 

As for now we can not (logically) partition clusters, since the 
subnetwork concept is not
fully implemented.  It is however possible to physically/geographically 
partition clusters
if we let the inter-site links go over UDP or TCP. The condition is that 
we still configure
the cluster as a full-connectivity network. Of course, one must as 
always understand the
bandwidth constraints in such cases. Again, this has never been tried in 
real products, so
this is an unknown factor for now.

The way we have partitioned our systems is to always configure each site 
as  separate zones
 (clusters), because we had other restraints making this most practical. 
Inter-zone links
can then be set up via udp, but the "location transparency" stops at the 
cluster edge.

>
>
> - -Corey
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.0.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQE+TC3ymUvlb4BhfF4RAhF/AJ4sBYWMhjE+DaQutGYzMlyluTqHRACfUckV
> FuSIqrPSBW1BMih3I8QYTSU=
> =gsaP
> -----END PGP SIGNATURE-----
>