[Lightning-dev] On Path Privacy

Thu Dec 26 03:03:13 UTC 2019

Privacy Loss Models
===================

Shortest Path Heuristic
-----------------------

First, let us presuppose a network connected as below:

        B --- C
       /       \
      /         \
     A           D
      \         /
       \       /
        F --- E

Now, let us suppose that some evil surveillor insinuates itself into the network:

        B --- C
       / \   / \
      /   \ /   \
     A --- S --- D
      \   / \   /
       \ /   \ /
        F --- E

The strategy of the surveillor is thus the below:

* It makes itself more central to the network by creating large numbers of channels, especially to distant nodes.
* It lowers its channel fees relative to the other nodes, making its channels attractive to pathfinders that consider channel fees.
* It uses the "shortest path heuristic".
  That is:
  * Suppose in the given example, S receives a forward via the B-S channel, to the S-D channel.
  * Then it knows the payer is B and the payee is D.

More precisely, the "shortest path heuristic" simply means performing the below precomputation:

* Take the entire routemap.
* Create a mapping of the tuple of an incoming channel and an outgoing channel (of the surveillor node), to a set of pairs of payers and payees.
* For each possible payer in all nodes:
  * For each each possible payee in all nodes:
    * Take the shortest path from payer to payee.
    * If the surveillor node is in that path, determine the incoming channel and outgoing channel, then add the payer and payee to the corresponding entry in the mapping.

Then, whenever the surveillor node has to perform a forward, it only needs to look up the incoming channel and outgoing channel in the mapping.
This results in a set of possible payer/payee pairs.

This analysis becomes more powerful when the surveillor has acquired forwarding logs from multiple nodes on the network.
If a payment is determined to have passed through multiple nodes, whose forwarding logs the surveillor has acquired, then the surveillor can simply perform the above precomputation for each node, then take the intersection of the sets in each mapping entry.

Much of the privacy loss is effectively due to forwarding nodes being able to create "reverse" routing tables.
That is, instead of a routing table that inputs a source and destination and outputs a (set of) likely direction(s) to send packets to, a reverse routing table inputs a direction from sent and to sent and outputs a (set of) likely source(s)/destination(s).

Unremovable Forwarding Data
---------------------------

As we know, switching Lightning from HTLCs to PTLCs (i.e. from payment hash + preimage to payment point + scalar) enables many new features, including the so-called "path decorrelation".
The theory is that by changing the payment point at each hop, multiple coordinating surveillor nodes will be unable to determine if the same payment passed through them.

However, this is not sufficient, as the data below cannot be removed or hidden:

* The sidereal time at which `update_add_htlc` messages are sent
  * This can be used to determine if two forwards are possibly on the same payment path.
  * Worse, the sidereal time between the outgoing `update_add_htlc` and the `update_fail_htlc` /  `update_fulfill_htlc` response can be measured to help characterize the distance, in nodes, to the payee.
  * We cannot change or mask this time, unless we are willing to add random, potentially-large latencies to forwards, which we do not want to do since the entire point of Lightning Network is fast payments.
* The value being transmitted cannot be hidden from the forwarding node.
  * While we know how to send homomorphically encrypted values that can be summed up, the forwarding node has to know the value of its own channels, and would know how much a forward changes its own channel values in order to determine if it has enough capacity on a channel to actually offer the outgoing HTLC.
* The timelock of the payment cannot be hidden from the forwarding node.
  * The forwarding node is liable for the timelock of any HTLCs it offers, and it has to know about the exact timelock in order to force a channel onchain when any HTLCs pending on it have a timelock that is expiring very soon.

The above data helps two cooperating surveilling nodes to determine if two forwards on their nodes belong to the same logical payment, even if path decorrelation masks the payment point along the entire path:

* The two nodes can determine:
  * How long a shortest-path payment takes to reach from one to the other.
  * How much fees the shortest path between them costs.
  * How much CLTV-delta the shortest path between them costs.

By deriving the above data, two nodes can have a very good guess of which forwards passing through them are likely to belong to the same payment, which helps them triangulate better the payer and payee, especially given the shortest path heuristic.
Thus, even *with* path decorrelation, two cooperating surveilling nodes can still correlate forwards across them (they just will not be able to execute a wormhole attack).

Now, of note is that the the timelock is particularly leaky in terms of privacy.
The timelock gives an upper bound on the distance of the forwarding node to the ultimate payee.
In combination with another privacy loss, this can allow a forwarding node to precisely identify the ultimate payee of a payment.

Both value and timelock can be randomized somewhat by use of "shadow routes", as implemented in C-Lightning.
However, the current C-Lightning implementation only inserts shadow routes at the end of the route; it should also insert shadow routes at random points in the route, and mildly overpay fees along the way (not just give extra value+cltv-delta to the last node), to reduce the ability of multiple cooperating surveillors to determine if their forwards are the same payment.

REDACTED SECTION
----------------

REDACTED TEXT.

Non-Solutions to Privacy Loss
=============================

Unpublished Channels Delenda Est
--------------------------------

Unpublished channels are becoming popular for some reason, and even section 4.1 of [this paper](https://arxiv.org/pdf/1909.02717) gives it as a solution for improving privacy without sacrificing utility.

Unfortunately for that paper, this is wrong, and unpublished channels must be destroyed.

Unpublished channels do not improve privacy, but instead move the privacy risks from shared to concentrated on specific nodes, which become targets for surveillors.
Further, because it moves risk to fewer high-value targets (a symptom of centralization), it violates [Risk Sharing Principle](https://github.com/libbitcoin/libbitcoin-system/wiki/Risk-Sharing-Principle).

As I noted [here](https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-October/002241.html), the Axiom of Terminus means that I can have the heuristic:

* If I receive a forwarding request:
  * If the incoming channel is unpublished, the node on the other end is the payer.
  * If the outgoing channel is unpublished, the node on the other end is the payee.

This gives very precise information about who the payer and payee is, and is not fixable by e.g. shadow routing or [REDACTED TEXT].

Now, of course the above heuristic is only doable by primarily-forwarding nodes with a mix of unpublished and published channels, and only to nodes that use the unpublished channels.
Otherwise, independent third parties cannot even know about nodes that *only* use unpublished channels, so that should be good, right?

Unfortunately:

* Payers have pretty good privacy otherwise: the (unremovable) timelock is a hint to the distance to the payee, not payer, and the payer could always claim that it received the payment via other published channels.
  * Thus, ***the fact that with unpublished channels some nodes can now identify the payer precisely is a major privacy loss for payers***.
* Those nodes with a mix of unpublished channels and published channels become major targets for surveillance.
  * They may be tempted by the dark side to defray their operating costs by actively selling the payment information they have acquired.
  * Their backups and persistent storage can be hacked and copies of forwarding logs acquired, even if they are strongly committed to privacy for their clients.
  * Competitors who *do* sell their payment information can outcompete them, undercutting their fees (which also has the effect that payments are more likely to be routed to them due to the lower fees, increasing their ability to surveill further).
* The existence of unpublished channels is still attested on the blockchain, and onchain analysis can give hints on which unpublished users ended up making channels with specific published nodes.
  * Lightning is practically the only use of 2-of-2 P2WSH, since most HODL usages will be either 1-of-1 or 2-of-3 ("never take two chronometers to sea: take one or three").
  * Since current Bitcoin requires revealing a 2-of-2 script in P2WSH, even mutual closes of unpublished channels are visible publicly on the blockchain anyway, and subsequent use of that money can be traceable to specific servers.
    * Taproot fixes this.
    * CoinJoin fixes this.

If all channels were published, then this spreads out the information that forwarding nodes can keep.
Surveillors capable of stealing forwarding logs of nodes will need to gather the forwarding nodes of large numbers of nodes.
This is risk sharing: privacy-leaking information is spread out and no single node is a particularly juicy target for acquisition of control.

Note that nodes can publish themselves without publishing their IP addresses; thus they can at least hide their locations at the IP layer while revealing their locations at the Lightning layer.
Nodes can also use Tor, at the cost of increased latency with the rest of the network, which behooves us to design our protocol to minimize bandwidth and turnaroud.

Refusal to Log
--------------

Now, a forwarding node operator may not necessarily be an evil surveillor, and may even have a desire to be (seen as) good.
Such a node operator might, as protection against hacks, and protection against evil surveillors who might hack their node and snoop their data, refuse to maintain a log of forwards.

Let us presuppose that there exists some magic spell, that ensures that a node operator is in fact performing (or not performing) some promised act (or disavowed act).
I will now show that, even if such a spell could exist, a node operator cannot promise to not keep a log of forwards, at least with the current HTLCs being sent via Poon-Dryja channels.

First:

* Poon-Dryja requires transaction revocation.
* Transaction revocation requires that nodes be able to recognize old state and revoke them.
* Revoking a transaction means spending it using a special revocation branch of a special modified script.
* Special scripts in Bitcoin must be encoded in P2SH/P2WSH.
* To spend a P2SH/P2WSH, we need to show the whole encoded script, even those branches that are not taken.
* Thus, to spend a revocable HTLC output, a node has to store the hash and timelock, even if it is taking the revocation path.
  * While HTLCs currently used have yet another transaction (HTLC-timeout or HTLC-success) to actually claim the funds, both the HTLC and the claim transactions have to be revocable, else an attacker might publish an old state that has all channel funds in HTLCs.
    If the victim is unable to publish the HTLC script in order to spend (and revoke) the plain HTLC output without an HTLC-timeout or HTLC-success transaction being published, then the attacker can simply refuse to publish the HTLC-timeout/HTLC-success and the victim still loses access to the funds (though the attacker will still not be able to claim it, the attacker might just want to make the victim lose funds).
    Thus, every node must be able to publish the HTLC script in order to revoke it (even without the attacker publishing HTLC-success/HTCL-timeout), which means it has to keep every old state hash and timelock.

Then, secondly:

* The hashes used in forwarding have to be the same in both the incoming HTLC and the outgoing HTLC.

Thus:

* Nodes have to keep all hashes and timelocks for all HTLCs of all older channel states of all channels, otherwise they are insecure.
* Third parties acquiring the above data from nodes can correlate forwards by checking for HTLCs that exist in two channels.
  * Indeed, because HTLC scripts are different based on whether it is the local or the remote side that offered them, third parties can determine as well which HTLC is incoming and which it outgoing.
  * Further, the stored timelock gives an upper bound on the time an HTLC was created, giving hints on timing of a payment as well.
* Therefore all current secure nodes are required to store data that is enough to derive a log of forwards.

Of course, this is not so dire.
In the (close?) future, improvements to base layer will allow nodes to elide enough information from its stored data, so that a log of forwards can no longer be derived:

* Schnorr means we can use pointlocked timelocked contracts with payment decorrelation, thus allowing breaking of the link between incoming and outgoing PTLCs even in the data stored by nodes.
  * Once an outgoing PTLC has been claimed and we have determined the incoming PTLC, we can delete the delta between the two contracts from persistent storage, thus preventing from linking incoming from outgoing PTLCs.
* Taproot means we can put the revocation branch of revocable HTLCs in a separate tapleaf.
  * Thus, nodes can store only the Merkle path to the revocation script, and thus no longer store the hash and timelock for all HTLCs on all older states.
* `SIGHASH_NOINPUT` means we can implement Decker-Russell-Osuntokun.
  * Thus, honest nodes no longer need to store data from previous states of the channel.

Against the above, however, I must remind:

* We still need to create a magic spell that allows third parties to actually believe, without tr\*st, the promises of node operators that they will not store a plain log of forwards of their node, and that they will not modify the available open-source software to do so and run their locally-modified versions.
* Even if the above magic spell were created, at ***some*** point in the past the node knew an incoming HTLC/PTLC and its link to an outgoing HTLC/PTLC.
  * Lower layers beneath the node software may betray the above magic spell nevertheless:
    * "Deleted" data might be implemented by a database layer by simply marking some record as unused, and then reuse of that record might take a long time, meaning that old data may still exist in "deleted" records.
    * The filesystem layer might implement recovery logs or other techniques where "overwritten" data is kept, as a simple function of reliability, and that data may remain validly readable outside the filesystem (i.e. deleted data might not be overwritten).
    * The spinning rust / floating gate lower layers may, even with data overwritten, retain transient inductive / capacitive charge that can be read by specialized equipment, or may not overwrite data from sectors / blocks marked bad and then remapped elsewhere.
  * In short, we cannot "refuse to log": we have to write the link between an outgoing HTLC/PTLC to an incoming HTLC/PTLC somewhere persistent *first* (in case our node gets randomly shut down while a forward is not yet claimed), wait for the outgoing HTLC/PTLC to be claimed, and *then* delete it later.
    And "delete" is not necessarily an operation that is possible in the presence of attackers that want to retain your data.

Possible Solutions
==================

Avoiding Shortest Path
----------------------

The shortest path heuristic is key in analyzing payments on the Lightning network.
Thus, breaking this heuristic is important.

* One way is to `permuteroute` the shortest-path.
  * For reference, `permuteroute` is described here: https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-August/002095.html
    * Briefly: It is an algorithm where you input an existing route, plus a "breakage" (a channel or node that is failing) on that route.
      The algorithm then makes a short-range pathfinding to heal the break, or fails if it takes too long for it to find a way to heal the break.
  * Start with a shortest path via normal shortest-path pathfinding algorithms, then randomly select one intermediate node on that path and treat it as a "breakage" with `permuteroute`.
    * If that succeeds, try it a few more times until it fails or you are satisfied with the additional path length, then just use the latest randomized path.
* Add random tweaks to your channel traversal costs.
  * This is done currently by the C-Lightning route randomization feature, but note that it is currently set to up to a +/-5% tweak.
    * [This paper](https://arxiv.org/pdf/1909.06890) evaluates the C-Lightning route randomization as well.
      It suggest adding random tweaks to the costs of entire routes instead, though it may not be easy for normal pathfinding algorithms to implement this.
* Rusty Russell Random Route Ralgorithm.
  * Alluded to in [SLP134](https://stephanlivera.com/episode/134/).
  * In a private email, Rusty recounted the random routing routine; roughly reworded:
    * Start with a Dijkstra starting from the payer node and completing over the entire routemap.
      * Store the distance from each node to the payer.
      * This step can be precomputed, rather than performed each time we need a path.
    * Then the algorithm proper:
      * Start at the destination. payee node.
      * Toss a biased coin (bias controlled by how much you want to overpay fees):
        * If it is heads, pick a channel at random and extend the route to it.
        * If it is tails, find which channel goes to the node with lowest-distance-to-payer and extend the route to it.
      * Keep doing the above until we reach the payer node.
    * Then apply shadow routing afterwards.
  * Note that the above algorithm is roughly similar to the Greedy Best First Search with precomputed distance heuristic (a.k.a. `getroutequick`, which I presented in LNConf 2019).
    * The only difference is that there is a random coin toss between selecting a random channel and going towards the source node, where `getroutequick` just goes towards the source node.
    * We can approximate the Rusty Russell Random Routing Ralgorithm by applying a random tweak to the distance heuristic in a Greedy Best First Search!
  * The algorithm as described has a chance of creating small loops, which should be avoided.
    * Small loops that curl (i.e. pass multiple times) through a surveillor node are equivalent to a payment passing through multiple surveillor-controlled nodes, and can be identified even post-payment-decorrelation as being the same payment.
      * Thus you end up paying for routing through the loop, but do not gain much privacy from that extra payment because now you are giving more accurate data to a specific node (where the payment curled through).
    * A proper Greedy Best First Search avoids loops due to marking nodes as they are visited, so I think basing the algorithm around Greedy Best First Search plus random tweak on the heuristic is better.
* Usage discounting.
  * When a base pathfinder returns a path, increment a "usedness" counter on each node on that path.
  * On future pathfinding calls, consider those nodes to have higher than normal cost, according to their usedness.
    * Of note is that pathfinding algorithms do NOT actually need a single numeric cost.
      * Instead, they need a mathematical group and group operation, with an additional "comparison" operation to determine which of two group elements is smaller.
      * Integer costs are a group with a comparison operation.
      * Instead of integer cost, we can have a pair of `(usedness, fee)`.
        * Add each tuple element separately, e.g. `(usedness_a, fee_a) + (usedness_b, fee_b) = (usedness_a + usedness_b, fee_a + fee_b)`.
        * To compare, e.g. `(usedness_a, fee_a) < (usedness_b, fee_b)`, we first compare by `usedness`, then if they are equal, compare `fee`.
  * This gives good diversity over multiple calls of the pathfinder, which is useful for:
    * Multipath, since we want each candidate path to have different nodes as much as possible, without completely discounting those nodes in case of bottlenecks in the network.
    * Ensures your payment data is distributed to multiple nodes rather than reusing the same nodes again and again, improving Risk Sharing since your data is spread out over more nodes.
  * The `usedness` needs to persist across node restarts.

Direct Path Exemption
---------------------

As mentioned above, avoiding the shortest path breaks the shortest path heuristic, adding uncertainty to analyses and thus improving privacy.

However, we should note an important exception to avoiding the shortest path:

* If the shortest path is a single hop, directly from the payer to the payee without intervening nodes, then we should prefer that path rather than avoid it.

The reason why we have this exception is that ***not telling any other nodes gives us the best privacy***.
This is achievable only by having direct channels with a payee, thus it ***does not scale***, since it would not be practical to have direct channels between every payer and payee.
However, if we find that the payment is to a direct peer, taking this exception should be a massive, low-hanging privacy win.

This brings up the following observation:

* In order to avoid the shortest path, we may need to add more nodes to the path we use.
  * By adding more nodes to the path we use, we increase the number of nodes that know our payment, increasing our chances of informing some surveillor.

Extended Shadow Routing
-----------------------

Another apropos observation is that having more nodes on a path reduces the reliability of payments along the path.
Every node on that path is a possible failure point in forwarding.

Worse, paths that are longer in terms of hops are more likely to have the below failure mode:

* A forwarding node is able to accept and irrevocably forward an HTLC.
* Before the outgoing HTLC can be claimed offchain, the forwarding node shuts down.
* The payment is now "stuck" until the incoming HTLC of the node reaches timeout.

Basically, if a forwarding node fails between the time it forwards the HTLC and the time the next node is able to claim it, payments get stuck for durations that can be counted in entire blocks.
As path length increases, this failure mode becomes more likely:

* There are more possible nodes to fail.
* The time between when a node is able to forward and the next node is able to claim becomes longer, making it more likely that an individual node fails during this time.

Of course, taking the shortest path in terms of node hops is precisely what reverse routing tables can break.
This is the core of the observation that privacy and usability are apparently at odds with one another.

Reverse routing tables cannot precisely identify the payer and payee.
However, the timelock, as well as the time it takes for an HTLC to be forwarded and then claimed, can give hints as to the distance between the forwarding node and the payee.
We cannot do anything about the sidereal time at which events occur, but we *can* do something about timelocks, i.e. shadow routing as already implemented in C-Lightning.

A secondary concern is that, after path decorrelation is implemented, the *delta* in value, timelock, and sidereal time between two cooperating surveilling nodes can be used to determine if two forwards on those nodes belong to the same overall payment.
Shadow routing does not help this, since it only adds timelock and value tweaks to the *end* of the path, and any *delta* between two surveilling nodes remains the same.

Avoiding the shortest path has a chance that the sub-path between the two surveilling nodes is not the delta of the shortest path between them, which helps remove payment correlation in a post-decorrelation world.
However, it does have the drawback that increasing the actual path length tends to increase the number of nodes along the path, reducing reliability.

By inserting shadow routes in the middle of the path instead of just the payee end, we can gain some of the above advantage of removing payment correlation in a post-decorrelation world, without having to incur the reduced reliability of actually increasing the number of hops along a path.
This is only useful in a post-decorrelation world, as use of HTLCs means that two cooperating surveilling nodes can determine if two forwards between them belong to the same overall payment.

Inserting such shadow routes basically amounts to overpaying fees and giving more timelock slack to forwarding nodes.

Multipath Value Obfuscation
---------------------------

An observation we can have is that, the difference between a fixed known value and a random value will be indistinguishable from a random value.
The same observation underlies adaptor signatures and payment decorrelation.

Thus, one way to obfuscate the unremovable data of the payment value is to always start with payments split into two subpayments.

* Take a random value from 1 to the total payment minus 1 and use it on one subpayment.
* Take the difference between the total payment amount and the above random value and use it as the other subpayment.

Both numbers will look like random values, and only together can they be used to identify the actual value.

The value obfuscation only works to obfuscate the final payee value.
Without "avoiding shortest path" or "extended shadow route", the value of the payment can still be used by two cooperating surveilling nodes to identify if two forwards belong to the same (sub)payment, even in a post-decorrelation world.

With HTLCs, two cooperating surveilling nodes that happen to be on separate subpayments can still identify the total payment and thereby have a guess on what is being bought.