jon.maloy at ericsson.com
Thu Feb 13 16:56:17 PST 2003
Corey Minyard wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Jon Maloy wrote:
> |>If I understand the document correctly, the "Hello" mechanism and
> |>routing table discovery/maintenance used by TIPC can have scaling
> |>complications on very large systems (100's to 1000's of communicating
> |>agents) when configured with insufficient internal connectivity. A
> |>spanning tree-based algorithm for maintaining the routing tables looks
> |>like the ideal solution (nearly all of the needed adjacency information
> |>is present) to apply here, rather than hardware-based broadcast on a
> |>subnet or software-based "replicast."
> | I can not se that the "hello" mechanism is a limitation, it is only
> a broadcast/
> | multicast sent out over a limited period of time, using (now) an
> | backoff algorithm to determine frequency.
> This depends on the discover time you require. But worst-case
> discover times are probably in the 10's of seconds, so it's still not
> too bad. But if you have another way to do it, why take the overhead
> for it?
Excactly. The Hello protocol is only one, optional, way of neigbour
which can be switched off. Any way of telling TIPC about existence of
will do, configuring, hardcoding or other. There is a generic call
(which can even be done
remotely via a TIPC message ) towards the core for adding links to a
node, and TIPC
doesn't care about how the caller obtained this information.
> | A potentially bigger problem is the background supervision of links
> | we have hundreds of them in each processor. I have made some
> | on this, and with modern processor speed, memory amounts and bandwidth
> | available we should be able to handle clusters with ~1000 nodes without
> | any significant background load. (Remember that while total number of
> | links grow as (nodes^2) for the whole cluster, number of links to
> | *per node* still only grows at a linear rate.) This would have been
> a problem
> | in the 90:s, but as long as processor speed keeps evolving (a lot)
> faster than
> | cluster sizes this does not pose any serious problem. Moore's law is
> still valid.
> This depends on your required failure detection times. If you need
> 500ms failover times, you are probably going to need 125ms resends.
> That corresponds to 8000 messages/second
First, when there is traffic on the links no supervision messages are
exchanged at all,
since the timer will detect that the link is active and just go to
Second, 8000 60-byte-msg/s module-to-module is fully possible if the
bandwith is there.
(See below). Furthermore, this will only happen when all the links are
the processor probably does not have much else to do.
> you have to receive and send (probably more, since due to timing and
> latency both sides might decide to transmit the keepalive on the link
> at the same time.
No. One party will take control, the other one will only become a
passive responder, since it will
detect the incoming probes from the peer.
> It's hard for me to imagine that even 1000 msg/sec is trivial to
> handle on even a modern CPU.
We have had no problem with exchanging 40,000 msg/s on 700 Mhz Pentium
III, and that was
I have great faith in what we can do with 5 Ghz processors in a couple
of years, and have
plenty of processor capacity left.
Anyway, aren't we discussing a rather hypotethic scenario now ? I
think clusters in the range
of 50-100 of processors will be a big enough challenge for the near
future. There are other
(non-communication related) problems to solve before we can pass beyond
> I was one of the system architects on a system designed to scale into
> the 10,000 node range. Really, if you require short failure detection
> times, the whole strategy of doing point-to-point links with timers
> falls apart at around 100 nodes without hardware support.
> Techniques do exist for building large systems like this and using
> very little CPU for overhead.
TIPC does not in any way preclude that the information about node
failure can come from somewhere
else, e.g. a network processor. It would be easy to add a call in the
interface for this, and then
disabling or slowing down the background timer.
However, I have a hard time imagining that node supervision via
ethernet can be done without
exchanging probe/heartbeat messages, and that this can be done much
more efficiently than in
With dedicated hardware it becomes a different matter, of course...
> Another, somewhat related question. How do you handle system
> partions? This was the nastiest problem we had to deal with
> (especially since we were multi-site distributed).
As for now we can not (logically) partition clusters, since the
subnetwork concept is not
fully implemented. It is however possible to physically/geographically
if we let the inter-site links go over UDP or TCP. The condition is that
we still configure
the cluster as a full-connectivity network. Of course, one must as
always understand the
bandwidth constraints in such cases. Again, this has never been tried in
real products, so
this is an unknown factor for now.
The way we have partitioned our systems is to always configure each site
as separate zones
(clusters), because we had other restraints making this most practical.
can then be set up via udp, but the "location transparency" stops at the
> - -Corey
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.0.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> -----END PGP SIGNATURE-----
More information about the cgl_discussion