[Lightning-dev] Improving Lightning Network Pathfinding Latency by Path Splicing and Other Real-Time Strategy Game Techniques

ZmnSCPxj ZmnSCPxj at protonmail.com
Thu Aug 1 01:35:21 UTC 2019


Introduction
============

I found out recently (mid-2019) that mainnet Lightning nodes take an inordinate amount of time to find a route between themselves and an arbitrary payee node.
Typical quotes suggested that commodity hardware would take 2 seconds to find a route, then take a few hundred milliseconds for the actual payment attempt.
With the help of Rene Pickhardt I was able to confirm that indeed, much of payment latency actually arises from the pathfinding algorithm and not the actual payment-over-routes.

This is concerning, of course, since we would like the public Lightning Network to grow a few more orders of magnitude.
Given that the best pathfinding search algorithms will be O(n log n) on the size of the network, we need to consider how to speed up the finding of routes.

`permuteroute` and Faster Next Pathfinding Attempts
===================================================

As I was collaborating with Rene, JIT-Routing was activated in my core processing hardware.

As I was contemplating this problem, I considered, that JIT-Routing would (ignoring fees) effectively "reroute" the original route around the failing channel.

In particular, JIT-Routing is advantageous for these reasons:

1.  There is no need to report back the failure to the original payer.
2.  The node performing JIT-Routing has accurate information about its channel balances and which of its outgoing channels would be most effective to route through instead of that indicated by the original payer.
    It also knows of non-published channels it has.
3.  Searching for a circular rebalancing route could be done much quicker since the JIT-Routing node could restrict itself to looking only in its friend-of-friend network, and simply fail if it could not find a circular rebalancing route quickly in the reduced search space.

The first two advantages cannot be emulated by the original payer.

However, I realized that the third advantage *could* be emulated by the original payer.
This is advantageous as the payer node can implement emulating this, without having to wait for the rest of the network to actually implement JIT-Routing as well.

Basically:

1.  On payment failure, the original payer is informed of which node reported the failure and which channel failed.
2.  The original payer looks for a *short* route from the reporting node to the next node (or any node after the reporting node in the original path).
    The original payer restricts its search to within 3 hops, roughly equivalent to searching the friend-of-friend network, thus able to quickly get a route or a failure indication.
    (we also need to get a set of known-bad channels other than the most-recently failing one, so that we do not bother to scan those channels)
3.  The original payer then replaces the failing channel with the new route found.
4.  As the new route might "backtrack" the original payer checks for duplicated nodes and removes the unneeded hops.

The payer is now in possession of a variation of the original route, one which avoids the known-failing channel.
It has a prefix that we know was recently reliable (the error got reported back to us, after all).
Finally, the payer did not take much time to find this alternate route, improving payment latency.

I dubbed this new algorithm `permuteroute`.

In a `pay` implementation, we would do:

1.  For the *first* routing attempt, use `getroute`.
2.  Use the current route for `sendpay`.
3.  If the pay attempt fails, use `permuteroute`.
    If returned route is too expensive or it fails, fall back to `getroute`.
4.  GOTO 2.

The `permuteroute` is fast due to not having to scan a large number of channels.
Ideally it would use Dijkstra with a restricted number of hops, but this is not strictly necessary and a simple flood-fill can work about as well.

JIT-Routing is still strictly superior to `permuteroute`, as the node reporting the failure has strictly more information about local conditions.
(Yes, it could return with a "suggested other channel", but that requires a spec change and it would be better to focus a spec change on feeless rebalancing to support JIT-Routing instead, especially since JIT-Routing can potentially effectively create a "local multipath payment" by rebalancing from multiple other channels, especially if rebalancing can be made feeless.)
However, JIT-Routing benefits start being noticeable only once a majority of nodes have supported it, whereas `permuteroute` gets immediate benefits to the node that upgrades to utilize it.

In particular, `permuteroute` would be helpful to support JIT-Routing nodes while there are still few JIT-Routing nodes.
A particular part of `permuteroute` is that it reuses the known-successful prefix of an existing route, replacing only channels from the failed portion and later.
If only one JIT-Routing node existed, it would suffer somewhat since while it would increase payment success rates, it would still lose (due to rebalancing fees) if a later channel in the hop fails and the node with that channel does not itself do JIT-Routing.
With `permuteroute` thie JIT-Routing node and its outgoing channel would end up being in the known-successful prefix of a `permuteroute` from the same payer, thus not wasting its "investment" in rebalancing in favor of its outgoing channel.

An unfinished pull request for `permuteroute` on C-Lightning exists: https://github.com/ElementsProject/lightning/pull/2890

Lightning Network Pathfinding Problem Statement
===============================================

Let us mildly digress.

Pathfinding on Lightning has these requirements:

1.  We are looking for a "good enough" path, not the most optimal path.
    Going for the optimal path is good, but users are more interested in payment success than purely fee optimization.
2.  The map we are doing the pathfinding in is dynamically changing and we might not have access to the latest state.
3.  We need to look for paths in a large map in a very short time.

Real-Time Strategy Game Pathfinding Problem Statement
=====================================================

Rene also mentioned that what LN generally calls "routefinding" is also called "pathfinding" elsewhere.
This triggered me to consider real-time strategy games and pathfinding algorithms used in them.

In particular, the above requirements we have, have analogies to the problem of real-time strategy games.

1.  Players want their units to start moving as soon as commanded; they have a myriad other things to manage and do not want to have to micromanage certain movements.
    For example, moving through known-safe locations or scouting into unknown places often does not need to be optimal: they just need to be *done* in a manner that is not thoroughly stupid.
    If players need a specific path to be followed, they will micromanage their units, but non-micromanaged units are expected to perform their actions in the background, and non-optimal paths for non-micromanaged units are often acceptable as long as it is not ***too*** sub-optimal.
2.  Good real-time strategy games have a dynamically-changing world: trees are cut down for wood, walls are erected by opponents, structures are placed for various optimal results, mobile units deploy into immobile modes, and so on.
    Good real-time strategy games will also not let the units of a player know the ***current*** real state of the game world, as that leaks strategic information about where walls and defense structures have been built, where resources have been harvested, what chokepoints are blocked by immobilized units with increased combat effectiveness, etc.
    Instead, units are given fog-of-war-covered maps, with obsolete information, and thus may find their pre-solved paths blocked, or may find that shortcuts now exist which did not exist before.
    Even so, units must be able to correct their current paths once they discover this fact, without too much latency (as they might now be subject to attack by the immobile units siege-moded against them, or defense structures behind walls, etc.).
3.  As a real-time game, commands must be reacted to in a time-frame that is below typical human notice.
    While real-time games need to traverse simpler, smaller, and more regular maps than LN routemaps, they also need to get results for possibly dozens of different units in a few dozen milliseconds, whereas LN pathfinding would probably be well-satisfied by getting one route after a few hundred milliseconds.

Thus, I started looking for techniques and shortcuts used in real-time strategy games.

A\* (A-star) is the preferred algorithm in the "geographic" maps usually used in real-time strategy games.
In many cases we can replace it "seamlessly" with Dijkstra in the LN case: a constant `h(n)` function makes A\* devolve to Dijkstra, and if we admit that we have no decent heuristic "distance" function, we can consider ourselves as "really" using A\* with a constant `h(n)` (which happens to make the algorithm behave as Dijkstra).

But strictly speaking, using A\* (and its "degenerate" form Dijkstra) is overrated: http://bitsquid.blogspot.com/2010/10/is-overrated.html

Path Splicing
=============

One technique I found was Path Splicing: http://theory.stanford.edu/~amitp/GameProgramming/MovingObstacles.html#path-splicing

I realized that this technique was exactly what my `permuteroute` algorithm did: instead of solving for a completely new path, it would attempt to find a short route to "heal" a recently-discovered blockage of the original path.
The technique even recommends keeping the search space small; if the path needed to heal the original path becomes too long, it would be better to fall back to a "full" pathfinding.

As a corollary to this, JIT-Routing is itself an implementation of a form of Path Splicing, except with the potential to form a "local multipath payment" when a payment is routed through the node.

Emboldened, I began reading more and considering other ways to improve pathfinding latency, as pioneered by real-time strategy games.

Digression: Parametrizing `pay`
===============================

C-Lightning `pay` command (whose first "major version" was created by me, though the current implementation is vastly different from my work, but keeps most of its features and interface) is parametrized with two tweakable arguments:

1.  `maxfeepercent` - The maximum fee, in percentage of the payment value, that the user is willing to pay for payment success.
2.  `maxdelay` - The maximum number of blocks that the user is willing to wait in case of long-term failure of direct peer.

This is relevant since costs in Lightning are not a single "cost of travelling over this terrain" as in real-time strategy games.
Instead, costs in Lightning involve two things: the fee charged by the node, and the number of blocks in the difference of the incoming versus outgoing HTLCs.

In particular, internally to C-Lightning, `getroute` has a `riskfactor` parameter, which is intended to balance between these two costs (fees vs blocks).
This parameter is expressed as "annual cost of your funds being stuck".

I personally cannot understand what this "annual cost of your funds being stuck" means exactly in relation to my own money and risk-aversion, and have just used whatever value is the default for `getroute`.
I can understand the `maxfeepercent` and `maxdelay` better, and I imagine most users would also understand these better (though I have not talked to any users at all).
The documentation for the `getroute` command even contains a nice table about how the various settings for `riskfactor` affect your estimation of the cost of risk depending on how large the payment is and how many blocks the hop node charges.
I still cannot understand it even so, and my own expectation is that most users will also fail to grasp this intuitively.
On the other hand, `maxfeepercent` and `maxdelay` require no tables to explain.

In particular, `pay` always compares the total delay and the total fee to the given `maxfeepercent` and `maxdelay`.
In theory, if only one is breached but not the other, `pay` could tweak the `riskfactor` in order to change the balance.

The difficulty here is that in many of the succeeding sections, we will find that we will need to use some precomputed data, for which this balance of time vs. fee needs to be selected somehow.
Thus we want to use some reasonable default for this balancing parameter.
If the user wants to diverge from this default balancing, they would suffer as they would need to fall back on not using the precomputed data, reducing the speed of their payments.

This should be kept in mind for the succeeding sections: algorithms using precomputed data must by necessity be less flexible (it is restricted to the domain for which the data was precomputed for), and the price of flexibility will be slowness of a more general algorithm.

The possibility is that we are in fact being more flexible than absolutely necessary, and some shortcuts may be acceptable as a trade-off for faster pathfinding.

Map Preprocessing
=================

Often, real-time strategy game maps can be preprocessed.

For example, wide-open areas often take A\* a lot of time exploring this space.
However, in principle such wide-open areas should not be a pathfinding problem: any arbitrary path through those areas is generally fine and we should really have A\* focus on the fiddly bits between mountain passes and going through rough terrain and avoiding cul-de-sacs.
This is an example where finding the "optimum" path can take a back seat to finding "any acceptably-good" path: A\* searches along every good paths through plains in case there is some special path that is really optimal, even though usually any straight-line path through the plains is going to work well enough that the player will not feel he or she has to micromanage the unit path.

Map preprocessing can help by reweighing the costs in wide-open areas, making A\* go through a small number of acceptably-good paths rather than exploring all possible ways to walk a wide-open field (and eating up processing time in exploring all of them).

Now, we might not have the advantage of having a "geographical" map that can provide some reasonable heuristic distance function `h(n)` for A\*.
But we do have one advantage over real-time strategy games:

* Every payment that requires us to find a path, always starts from one node: our own node.

This is in stark contrast to real-time strategy games, where arbitrary units arise at various locations, and the pathfinding system must be capable of handling multiple possible start points.

In particular, Dijkstra has a variant where it does *not* look for the shortest path between two nodes, but instead finds the shortest path tree from one node to all other nodes on the map.
(This is arguably the "true" form of Dijkstra.)

This "true" variant of Dijkstra is a good thing to use, as we know that every payment sourced by our node requires a path starting from our node.

In C-Lightning in particular, every node object has a `struct`, named `dijkstra`, containing information used by the Dijkstra algorithm, size of 16 bytes, which is sufficient to derive a path from the destination to the source.
We can instantiate another copy of this `struct` to every node, let us call this `dijkstra_cached`, and run the Dijkstra algorithm for all nodes to the source using this alternate copy.

Then we can provide a new `getroutequick` command which simply looks up the destination node, then builds the route back to the specific source (our own node).
If route-building fails because the channels have been closed, it can simply fail (and the client has to fall back to the slow O(n log n) `getroute`, which uses the existing `dijkstra` field instead of `dijkstra_cached`).

A `pay` algorithm would then do:

1.  For the *first* payment attempt, use `getroutequick`.
    If the returned route is too expensive or `getroutequick` fails, fall back to `getroute`.
2.  Use the current route for `sendpay`.
3.  If the pay attempt fails, use `permuteroute`.
    If returned route is too expensive or it fails, fall back to `getroute`.
4.  GOTO 2.

Thus, the slow `getroute` only exists in the "fallback" case, and we have effectively optimized payment to arbitrary nodes.
Payment attempts can start to be sent out almost immediately.

***Now of course, the `dijkstra_cached` map will become obsolete as channels are created and destroyed***.

Periodically, we can start a background process to perform a new Dijkstra run to update the cached data.
We can implement this by a "fast" timer that performs some fraction of the Dijkstra run, then lets normal operation for some time before grabbing the CPU again to update the Dijkstra.
This lets us process normal requests (updates of channel, `getroute` and `permuteroute` requests, etc.) while refreshing the cached shortest path tree.
(other implementations might consider using separate threads, or some concept of low-priority thread: the C-lightning `gossipd` is single-threaded so we can avoid many of the issues with thread synchronization)

Of course, that means that `getroutequick` cannot work while we are performing this refresh.
To fix this, we turn to another game implementation technique: double-buffering.

We use a `dijkstra_cached[2]` array, then fill in *one* entry during the refresh while `getroutequick` calls follows the *other* entry.
Then when the refresh is completed, we "flip" the buffer that `getroutequick` uses (a single variable somewhere in our map object).
Then on the next refresh, we use the `dijkstra_cached` entry that is not being used for `getroutequick`, then flip again.

Thus the additional storage is 32 bytes per node, but potentially speeding up the first route request in a `pay` algorithm.

Of note, is that this is not intended to replace `getroute` --- instead this is a faster but more limited alternative.
We should still retain the generic `getroute` as a fallback in case the easy-to-find path exceeds `maxfeepercent` or `maxdelay`.

A\* on Lightning
----------------

As mentioned above, A\* is the typical algorithm in use by real-time strategy and other games.
Its primary difference from Dijkstra, is the existence of the `h(n)` heuristic function, which accepts an arbitrary node `n` and returns the expected cost of going from that node to the destination.

A\* has the same complexity O(n log n) as Dijkstra, but in practice finds paths faster than Dijkstra can since A\* focuses on exploring nodes that seem likely to reach the destination faster.
Runs of A\* vs Dijkstra on the same map require fewer nodes to be visited by A\* before reaching the destination, resulting in faster runs of A\* than Dijkstra.

However A\* use is not possible under Lightning as we cannot provide any estimate of the cost from a node to a particular other node, which is needed for the heuristic function `h(n)`.

That is, unless we already have the costs pre-computed and cached somewhere... such as the `getroutequick` Dijkstra run.

The `dijkstra` structure is basically just the total cost of arriving at that node from the source node in the Dijkstra run.
For the `dijkstra_cached` double-buffered structures, these are the total cost of arriving at that node from the singular source node we have, our own self node.

An A\* is now possible by starting at the payee node with the goal node being our own (payer) node.
The `dijkstra_cached` structures contain the cost of reaching the node from our own node, and serves as our `h(n)` for the A\*.
Assuming the channel topology then does not change, the `h(n)` is now an admissible heuristic, and in fact is an exact heuristic and A\* will run very quickly.
http://theory.stanford.edu/~amitp/GameProgramming/Heuristics.html#exact-heuristics

Now, as noted previously, this cached data may become stale and no longer exact.
There are two cases:

1.  A new channel reduces the actual cost of going to a node from our own node.
    This makes the cached data an inadmissible heuristic, as the cost of reaching that node from our own node is now lower than the cached cost.
2.  A deleted channel removes the shortest path to a node from our own node.
    The cached data remains an admissible heuristic and A\* can find an optimal route quickly with help of the existing cached `h(n)`.

The drawback of the first case may be acceptable in most cases: our tradeoff is faster payment attempts, not most-optimal-ever paths.
The second case is precisely the case that would cause the above `getroutequick` proposal to fail, so A\* still succeeding here is a bonus.

Thus, we can instead change the `getroutequick` algorithm as so:

1.  Locate the payee node.
2.  Attempt to generate a route from the payee by looking up the shortest-path tree to our own node.
3.  If the above fails because a channel has been closed, instead use A\* starting on the node we were unable to continue the quick path on, with the precomputed total costs used for `h(b)`.
    Splice from the generated partial route to the A\* route result, then fix up the path by removing loops (as in the `permuteroute` algorithm).

First using the `dijkstra_cached` data to form a route without A\* is faster since A\* will need to set up an OPEN set.
In addition, in case multiple candidates with same `h(n)` occur, A\* will insist on putting both in the OPEN set, when in principle we can just take either path (as both paths are equal in cost).

We expect `getroutequick` to still be fast in the best case where channels have not been removed and the shortest-path tree is still valid for the payment node.
If channels have been removed, then it falls back to almost-as-fast A\*, using the same cost data to approximate the distance to the source.
Thus, the same `pay` algorithm will work, and has an improved chance of being able to quickly find its first route.

Direct Channels
---------------

A user might notice that our node is taking long routes to a payee.
This user might then decide to manually open a direct channel to that payee in hope of reducing their fee and risk costs.

However, because we only periodically refresh the `dijkstra_cached` data, our initial attempts at paying will ignore the direct route.
This will disappoint the user and cause him or her to file an issue in our bug database, which is generally considered bad.

A simple way to fix this is to provide a `getroutedirect` that only scans for direct channels from our own node to the destination.
Then we simply prioritize using such a direct route first regardless of the state of our pathfinding acceleration structure.

As the user can only control its own node (typically), we generally can assume that the user will consider it acceptable if channels between nodes the user does not control are ignored in our pathfinding acceleration.
However since the user is in direct control of our own node, he or she expects that the node can find its own direct channels to the payee easily.
We can consider this equivalent to the user micromanaging our payments, and since the user can only control our own node, it can only do so by making direct channels.

Our `pay` algorithm then becomes:

1.  Use `getroutedirect`.
2.  If that fails, use `getroutequick`.
3.  If that fails, use `getroute`.
4.  Use the current route for `sendpay`.
5.  If `sendpay` fails, use `permuteroute`.
6.  If that fails, use `getroute`.
7.  GOTO 4.

Alternately, we can simply add this direct-channel behavior to `getroutequick`, having it try three sub-algorithms: direct channel finding, shortest-path tree traversal, and finally A\*.

Skip Links
==========

Note that `getroute` remains a bottleneck.
It is the fallback when `getroutequick` or `permuteroute` cannot find a path or returns a too-expensive path.

Another technique from the real-time strategy game cookbook is: https://theory.stanford.edu/~amitp/GameProgramming/MapRepresentations.html#skip-links

This is yet another "preprocessing" technique.
We have a continuously-running background task that simply randomly explores the map.

It starts at some randomly-selected node, then follows one least-cost channel to another node (avoiding already-visited nodes).
After a few such hops, it creates a "skip link" from the first node to the last node.
This is an extra link containing the total cost of the path from the first node to the last node, as well as the exact path itself.

The stored total cost of the path is discounted mildly; typical guidelines are 1% reduction of the cost.
The Dijkstra algorithm immediately puts the other end of the skip link to the unvisited set, pointing through the skip link.

The discount exists to "break ties".
A "tie" occurs when the cost on two possible paths is equal.
This makes the discount bias towards the preexisting skip link.

Skip links need to store the individual channels passed through.
In particular, if we know that one of the component channels of the skip link is currently failing, we should ignore the entire skip link.

With skip links, there is now a tiny possibility of non-optimal pathfinding, which should still be fine.
(we could implement a `getrouteslow` that ignores skip links.)
In particular we might reach a destination by first going through a skip link that goes through that node, then a "back out" hop from the end of the skip link to the destination node.
This can be fixed by a postprocessing step (similar to that in `permuteroute`) to shortcut such loops in the route.

Hierarchical Maps and Rough/Smooth Pathfinding
==============================================

Another preprocessing idea for maps is to create a high-level "rough" map and a low-level "smooth" map.
Then, we can start with a "rough" path and then incrementally promote this to a "smooth" path with all the details as our game unit / payment approaches the next location / node in the rough path.

See: https://theory.stanford.edu/~amitp/GameProgramming/MapRepresentations.html#hierarchical

Now this might not make sense in the current Lightning Network.

However, do note that one incoming improvement we are planning to introduce is "trampoline routing", where we have a "rough" path of various (not necessarily adjacent) nodes embedded in a "smooth" path of directly-adjacent nodes.

So we can build a "rough" map from the normal detailed map as follows:

1.  Randomly select some small fraction of nodes with high liquidity in channels, and (if we can get this information) high uptime.
    We might want to randomly eliminate some of these for no reason from the set of nodes to consider.
    We shall call these the "high-level" nodes.
2.  Use the smooth map to determine the costs of the shortest paths between these nodes.
    Use a limited-hops Dijkstra to limit the rough links between nodes to some number of hops on the smooth map.
    Set the cost of the links in the "rough" map based on the cost of the smooth path.
3.  Find the largest "island" of interconnected high-levels nodes and use that as the high-level "rough" map.

This is a preprocessing stage very much like what we would do for `getroutequick`, and double-bufferring while refreshing this high-level map is likely to be useful as in the `getroutequick` case.

Then, when attempting to pay to an arbitrary node:

1.  Find a route from the destination node to *any* of the nodes in the high-level map; use the normal low-level "smooth" map.
    Dijkstra can also work with multiple candidate destinations, exiting once any of those candidates is reached.
    (we can also add the source node as a candidate, and with such a direct route can exit at this point without a trampoline route)
    Keep track of the "destination-side high-level node" and discard the actual route.
2.  Find a route from the source node to *any* of the nodes in the high-level map; use the normal low-level "smooth" map.
    Keep track of the "source-side high-level node" but retain the actual route.
3.  On the high-level "rough" map, find a route from source-side high-level node to the destination-side high-level node.
    Append the destination node to this to form the trampoline part of the route.

This provides a normal route to a high-level "rough" node, then a short trampoline route from that node to the destination.

Step #2 above can be cached: we can run a Dijkstra starting at our own node running on the "smooth" map, that terminates once it has reached some small number of high-level nodes (say the 5 nearest high-level nodes), then just round-robining among those nodes per invocation of route finding.
This caching can be done in the same refresh as the one that generated the high-level rough map in the first place.

Step #3 is expected to be very fast since it looks at the rough map, which has a tiny fraction of the nodes in the smooth map, and Dijkstra is O(n log n) on number of nodes.

This leaves step #1 to be the potentially slow part of this algorithm.
This is mitigated by the fact that this stage has multiple candidate destinations, and terminates as soon as it reaches *any* candidate.
Thus we expect Dijkstra to only have to explore a small amount of nodes.
The more distributed the high-level nodes are, the more likely that we can reach any high-level node from the destination in a small number of hops.

As channels open and close, the trampoline route remains potentially valid: as a trampoline node opens one hop of the trampoline route it can derive the "smooth" path from itself to the next trampoline node.
This thus implements the Rough/Smooth pathfinding idea.

If trampoline routes can be concatenated without permission from the original source of the trampoline route, then a trampoline node may itself utilize this algorithm when sending to the next trampoline node.

A privacy-preserving implementation would prefer to periodically "refresh" the high-level map (possibly changing the nodes that are put in the high-level map to increase the chances that it will not use a high-level node repeatedly).
Again, this will be similar to the `getroutequick` refresh, we double-buffer so that we can use the existing high-level rough map while generating the next high-level rough map.

Tr\*sted Rough Maps
-------------------

Centralized tr\*sted servers might exist that provide rough maps to nodes with extremely limited space (and cannot store the smooth, detailed channel-level map at all).
The server selects a small fraction of the available nodes, preferring high-uptime high-connectivity nodes, and generates a rough map as above.

Then, for each node, it determines the nearest high-level node.
Each high-level node has some fixed-size probabilistic set representation, such as a Bloom filter or a Goulomb-coded set.
When it determines for a node which high-level node it is nearest to, it adds that node to the Bloom filter / Goulomb-coded set for that high-level node.

A payer node that has no knowledge of the detailed channel map beyond its own channels can generate a rough trampoline route as follows:

1.  It searches the payee in each high-level node in the rough map, using the Bloom filter / Goulomb-coded set.
    Looking up in a Bloom filter would be O(1), and we would have to scan the entire map, so complexity is O(n) effectively.
2.  It searches for the nearest high-level node to itself, again using the Bloom filter / Goulomb-coded set.
3.  It uses the rough map to route from the high-level node near itself to the high-level node near the payee.
    Then it appends the payee to this route.

Knowing only its own channels, it can simply deliver the above generated trampoline route to any adjacent node that claims to support trampoline routing.

This provides some measure of privacy: the payer does not need to reveal any data to the server, other than the fact that it wants a rough map.
If the map is small enough, it could cache for multiple payment attempts.
Finally, it could also get multiple such maps from different servers, and score each map based on how successful payment ends up being when using that map.

These servers are tr\*sted to indicate high-uptime nodes with good liquidity and connectivity to the network, and to not lie about the actual network topology.
They are also tr\*sted to not generate maps that have themselves in the center of the map and have every other high-level node only have edges to their central node.
As such, users of this system must consider how much trust they can put in these servers.

The payer node will need to periodically get updates of the rough map as the topology changes.
A tr\*sted server might use the pay-for-data protocol to require payment for providing this map.
A newly deployed node will need to be started with a tr\*sted (and probably obsolete) rough map, however.
Alternately a protocol delivering a proof-of-payment can be executed onchain, or the server can publish its node pubkey and public contact point so that the node can start its first channel with the server and pay for the rough map over a direct channel.

Self-Serving
------------

Servers providing rough maps like the above can be written with free and open source software.
Thus, one might run a limited-smooth-map node on the same hardware as a server generating rough maps for the limited-smooth-map node.

One might wonder though, whether it might be better to just run a node that can keep the entire smooth map and not use a rough map at all.

The difference here is that the server does not need to perform any pathfinding in a tight schedule.
All its smooth-map pathfinding is done to simply determine which high-level node every node on the smooth map is near to, and thiis not need to be fast in order to serve outgoing payments quickly.
This means the server can potentially store the smooth routemap in a database in slow persistent storage rather than in high-speed memory.

Indeed, since the rough map is likely to still be mostly valid even as the network changes topology, the server can generate new rough maps while the hardware is idle, then coordinate with the limited-smooth-map node on every update of a fresh rough map.

This removes tr\*st issues, in much the same way that running Electrum Personal Server on a fullnode one controls removes the tr\*st issue in Electrum and leaves it as a high-quality wallet interface.

This also provides an approach towards scaling up our nodes to handle routemaps with 100 billion channels.
We can conceptually split our routemap between a "total global routemap" stored strictly on disk, and a "myopic local routemap" that is kept in memory.
The server stores the entire detailed global routemap on-disk.
The node software queries the server for two things:

1.  Nodes and channels within N hops of the node.
    This is the myopic local routemap.
2.  A rough routemap of the rest of the network as described above.

The node keeps both of the above in-memory, but they should be small compared to the total global routemap stored on-disk.
Keeping the map in-memory lets the node quickly find routes.

1.  If the destination is in the local routemap already, then the node does not need to do trampoline routing.
2.  If the destination is not in the local routemap, then the node uses the rough routemap to find a reasonable trampoline route, then finds a path in the local routemap to the first node in the trampoline.

Our bandwidth requirements still remain (we still need to gossip the entire routemap containing 100 billion channels).
Similarly, our on-disk space requirement still remains.
However it does greatly reduce our memory requirement, making it possible to still fit Lightning nodes on single-board-computer-level devices (admittedly, those that are connected to large permanent storage and fast Internet connections, but let me work on one problem at a time mkay?).

Myopic Trampoline Nodes
-----------------------

I have argued this before, but I think that nodes that do not have a complete smooth-level routemap should still be allowed to advertise themselves as trampoline nodes.

As the network grows to 100 billion channels, fewer and fewer nodes will have the ability to find detailed paths from themselves to any arbitrary next-trampoline-node.
It is important that the risk of running trampoline nodes be shared widely, and we should strongly prefer the "trampoline advertising feature bit" to be a proxy for "not too lazy to update the software" rather than "rich enough to afford really good hardware".

However, this implies that some trampoline nodes will fall back to themselves using a rough map to reach some nodes.
This implies that they need to prepend additional nodes to the trampoline route, without permission from whoever created the original trampoline route.

Flocking
========

A common issue in real-time strategy games is that a player, fighting for an important objective, needs to immediately send a large group of units to a location.
If all the units were to individually request for paths from the pathfinding system, then the units would lag and start moving one by one, as the pathfinding system gets overloaded and gives the paths one by one.
This is particularly bad behavior if the player is currently in a high-pressure situation where the objective might be lost soon if the commanded units are unable to move immediately to the objective.

Worse, often the pathfinding system will make the units follow almost the exact same path, meaning that slower units can prevent faster units from going around them, and they will all travel in single file, which is usually a strategic problem when their line is intercepted by enemy units.

The usual solution for this is, when multiple units are selected and commanded to move to a single location, is:

1.  Select one unit at random as leader.
2.  Have only the leader request a path from the pathfinding system.
3.  Have the other units "flock" around that leader: http://www.red3d.com/cwr/boids/

Flocking simply means that units follow some rules:

1.  Do not get too near other flock members.
2.  Do not get too far other flock members.
3.  Face in the same direction as other flock members (if units in the game world have this sense).
4.  Leaders ignore the above rules, but are considered flock members by the other flock members.

This reduces the pathfinding load to only one unit, the leader unit, and the rest of the units simply follow the designated leader until the leader reaches the objective location, or is killed or given another command (in which case another leader is chosen).
The individual units can request shorter paths from the pathfinding system to any point near the leader, or if it is "near enough" to the leader it can simply stop where it is.
As they have been following the leader, other units should be "near" to the indicated location when the ledaer has reached the destination, so the paths they request should be short and it should not take too long for the pathfinding system to get those paths.

Further, this often encourages the grouped units to move as a "blob" on multiple nearby paths rather than in single file, which is generally better for strategic defense when they are intercepted by enemies.

Now of course in Lightning Network we do not have anything like this problem, right?

Right?

Now consider the Feature Either Called AMP or Multi-Part Payments or Bass Amplifier.
In this case, we have multiple payments arising from a source, all of which are commanded to then converge at a singular destination.
We would like:

1.  Not to spend two entire seconds for each alternate path.
2.  Have the various split payments travel in different paths, not in "single file" where they mostly travel the same path except they start at different channels but almost immediately converge to travelling over the same channels.

Flocking Over Lightning
-----------------------

Suppose our node is asked to provide a large payment to some destination.
Our node then computes a path (possibly using any of the previous shortcuts to speed up this initial pathfinding).

However, the direct channel where this route starts from has too little liquidity on our side.
So our node also selects another channel from which to make up the difference in liquidity, using AMP/Multi-part/Bass Amplifier.

This is a "split" payment, and we then use a flocking system, with the first payment as the leader.
Our rules are:

1.  At each hop of the leader route, we should seek any route that gets us to within one node away from the leader.
2.  Channels going to nodes that the leader uses on its route (or which have been used by the other split payments in the flock) have a much higher cost.
3.  Channels directly connected to the source cannot be used at all (as that defeats the entire point of doing multipath).

We simulate the leader following its path one hop at a time.
The leader starts at the source, then takes one hop to the next node.

The split payment starts at the other node of the channel we selected to make up the difference in payments.
We then use a multi-destination Dijkstra that terminates as soon as it finds any node that is within one hop of where the leader currently is.
Channels used by the leader, and those already used by other splits, are penalized in cost so that we avoid reusing them, but still can pass through them if they are the only options available.
This initial Dijkstra can potentially take a long time, but hopefully it should be easy to cut short since it is a multiple-destination Dijkstra and achieving any destination terminates the algorithm.

For subsequent hops of the leader, we use a multi-destination limited-hops Dijkstra.
As the split payment starts one node away from the previous node of the leader, in the worst case it can simply just jump to the previous node of the leader if it cannot find an alternate path close by.
Jumping to this is penalized by rule 2 above so other paths are preferred, but this still allows the "flock" to pass through chokepoints in the network.

When we simulate the leader reaching the destination node, then we use a multi-destination limited-hops Dijkstra to the destination node.
This lets every split payment reach the leader.

Thus, the payments form a "blob" around the leader, instead of almost always following the same channels.
This provides better reliability of the entire payment attempt.

Conclusion
==========

Lightning Network does not need to pioneer any new research into pathfinding.
There already exists an entire field of programming that has similar issues (dynamic map, incomplete information, large maps with low acceptable latency of pathfinding, quickly-found "okay" paths are better than slowly-found "best" paths).
That is real-time strategy game programming, and such games have existed in some form or another for the past 4 decades.
Shortcuts, techniques, and optimizations used in real-time strategy games deserve closer examination, as potential points of inspiration for similar techniques for Lightning.

This writeup is a result of a quick review of pathfinding techniques used in real-time strategy games.
There are potentially many more techniques whose ideas might be possible to adapt in finding Lightning Network routes in large public networks in short amounts of time with acceptable fees and locktimes.



More information about the Lightning-dev mailing list