cminyard at mvista.com
Thu Feb 13 15:44:53 PST 2003
-----BEGIN PGP SIGNED MESSAGE-----
Jon Maloy wrote:
|>If I understand the document correctly, the "Hello" mechanism and
|>routing table discovery/maintenance used by TIPC can have scaling
|>complications on very large systems (100's to 1000's of communicating
|>agents) when configured with insufficient internal connectivity. A
|>spanning tree-based algorithm for maintaining the routing tables looks
|>like the ideal solution (nearly all of the needed adjacency information
|>is present) to apply here, rather than hardware-based broadcast on a
|>subnet or software-based "replicast."
| I can not se that the "hello" mechanism is a limitation, it is only a
| multicast sent out over a limited period of time, using (now) an
| backoff algorithm to determine frequency.
This depends on the discover time you require. But worst-case discover
times are probably in the 10's of seconds, so it's still not too bad.
But if you have another way to do it, why take the overhead for it?
| A potentially bigger problem is the background supervision of links when
| we have hundreds of them in each processor. I have made some calculations
| on this, and with modern processor speed, memory amounts and bandwidth
| available we should be able to handle clusters with ~1000 nodes without
| any significant background load. (Remember that while total number of
| links grow as (nodes^2) for the whole cluster, number of links to maintain
| *per node* still only grows at a linear rate.) This would have been a
| in the 90:s, but as long as processor speed keeps evolving (a lot)
| cluster sizes this does not pose any serious problem. Moore's law is
This depends on your required failure detection times. If you need
500ms failover times, you are probably going to need 125ms resends.
That corresponds to 8000 messages/second you have to receive and send
(probably more, since due to timing and latency both sides might decide
to transmit the keepalive on the link at the same time. It's hard for
me to imagine that even 1000 msg/sec is trivial to handle on even a
I was one of the system architects on a system designed to scale into
the 10,000 node range. Really, if you require short failure detection
times, the whole strategy of doing point-to-point links with timers
falls apart at around 100 nodes without hardware support.
Techniques do exist for building large systems like this and using very
little CPU for overhead.
Another, somewhat related question. How do you handle system partions?
This was the nastiest problem we had to deal with (especially since we
were multi-site distributed).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the cgl_discussion