[Lightning-dev] Fat Errors

Wed Oct 19 11:12:26 UTC 2022

Hi list,

I wanted to get back to a long-standing issue in Lightning: gaps in error
attribution. I've posted about this before back in 2019 [1].

Error attribution is important to properly penalize nodes after a payment
failure occurs. The goal of the penalty is to give the next attempt a
better chance at succeeding. In the happy failure flow, the sender is able
to determine the origin of the failure and penalizes a single node or pair
of nodes.

Unfortunately it is possible for nodes on the route to hide themselves. If
they return random data as the failure message, the sender won't know where
the failure happened. Some senders then penalize all nodes that were part
of the route [4][5]. This may exclude perfectly reliable nodes from being
used for future payments. Other senders penalize no nodes at all [6][7],
which allows the offending node to keep the disruption going.

A special case of this is a final node sending back random data. Senders
that penalize all nodes will keep looking for alternative routes. But
because each alternative route still ends with that same final node, the
sender will ultimately penalize all of its peers and possibly a lot of the
rest of the network too.

I can think of various reasons for exploiting this weakness. One is just
plain grievance for whatever reason. Another one is to attract more traffic
by getting competing routing nodes penalized. Or the goal could be to
sufficiently mess up reputation tracking of a specific sender node to make
it hard for that node to make further payments.

Related to this are delays in the path. A node can delay propagating back a
failure message and the sender won't be able to determine which node did
it.

The link at the top of this post [1] describes a way to address both
unreadable failure messages as well as delays by letting each node on the
route append a timestamp and hmac to the failure message. The great
challenge is to do this in such a way that nodes don’t learn their position
in the path.

I'm revisiting this idea, and have prototyped various ways to implement it.
In the remainder of this post, I will describe the variant that I thought
works best (so far).

# Failure message format

The basic idea of the new format is to let each node (not just the error
source) commit to the failure message when it passes it back by adding an
hmac. The sender verifies all hmacs upon receipt of the failure message.
This makes it impossible for any of the nodes to modify the failure message
without revealing that they might have played a part in the modification.
It won’t be possible for the sender to pinpoint an exact node, because
either end of a communication channel may have modified the message.
Pinpointing a pair of nodes however is good enough, and is commonly done
for regular onion failures too.

On the highest level, the new failure message consists of three parts:

`message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len)

* `message` is the standard onion failure message as described in [2], but
without the hmac. The hmac is now part of `hmacs` and doesn't need to be
repeated.

* `payloads` is a fixed length array that contains space for each node
(`hop_payload`) on the route to add data to return to the sender. Ideally
the contents and size of `hop_payload` is signaled so that future
extensions don’t require all nodes to upgrade. For now, we’ll assume the
following 9-byte format:

  `is_final` (1 byte) | `duration` (8 bytes)

  `is_final` indicates whether this node is the failure source. The sender
uses `is_final` to determine when to stop the decryption/verification
process.

  `duration` is the time in milliseconds that the node held the htlc. By
observing the series of reported durations, the sender is able to pinpoint
a delay down to a pair of nodes.

  The `hop_payload` is repeated 27 times (the maximum route length).

  Every hop shifts `payloads` 9 bytes to the right and puts its own
`hop_payload` in the 9 left-most bytes.

* `hmacs` is a fixed length array where nodes add their hmacs as the
failure message travels back to the sender.

  To keep things simple, I'll describe the format as if the maximum route
length was only three hops (instead of 27):

  `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0`

  Because nodes don't know their position in the path, it's unclear to them
what part of the failure message they are supposed to include in the hmac.
They can't just include everything, because if part of that data is deleted
later (to keep the message size fixed) it opens up the possibility for
nodes to blame others.

  The solution here is to provide hmacs for all possible positions. The
last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0`
to the block. Each hmac corresponds to a presumed position in the path,
where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0`
for the shortest (node is the error source).

  `hmac_x_y` is the hmac added by node x (counted from the node that is
currently handling the failure message) assuming that this node is y hops
away from the final node.

Before an hop adds its hmacs, it first deletes some of the previous hmacs.
This keeps the failure message at a fixed length. The removed hmacs are the
ones that cannot be useful anymore. If node 0 adds itself, the former node
0 (now node 1) cannot be at the first position anymore. The former node 1
(now node 2) cannot be at the second position anymore. The former node 2
cannot be the source of the error anymore and isn’t represented in the
failure message any longer. The corresponding hmacs (the now non-existent
`hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node 0.

Deleting the useless data reduces the number of hmacs (and roughly the
total failure message size) to half.

The delete operation transform the fields above to:

<empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0`

The exact data that is included in each hmac is:
  * `message`
  * the node’s own `hop_payload` and a set of downstream `hop_payload`s,
depending on assumed position
  * a set of downstream node hmacs, depending on assumed position

For example `hmac_0_1` is based on:

`message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0`

If the node that is currently handling the failure message is one hop away
from the final node, it needs to cover its own `hop_payload[0]`, the final
node `hop_payload[1]` and the final node hmac `hmac_1_0`.

A longer path is committed to in `hmac_0_2`:

`message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` |
`hmac_1_1` | `hmac_2_0`

The current node is two hops away from the final node. It needs to cover
its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]`
for the next and final hops. Additionally it covers the next hop `hmac_1_1`
and final hop `hmac_2_0`, which correspond to the positions of those nodes
in the path that is assumed for `hmac_0_2`.

With this information, the sender is able to verify the longest chain of
hmacs until it encounters a `hop_payload` with `is_final` set.

If any of the nodes messes with any byte in the failure message, the sender
is always able to determine a pair of nodes that the offending node is part
of. This statement can be verified through reasoning, but to be sure I also
tested it with code. I’ve simulated a malicious node that modifies a byte
of the failure message at index x and observed the error source as
determined by the sender. For every x, the sender reports the same correct
pair.

# Size

The obvious downside of the scheme above is the size. Given a maximum of 27
hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes each.
This makes for a total size of 12 KB.

It could be the case though that it is not possible to devise a more
compact scheme that also preserves the existing privacy guarantees. I know
that smart people have spent time on this problem, but nonetheless no
better solution has come up in the past years. A proof of its non-existence
would be interesting for sure.

I personally think the size increase is justified to fix this vulnerability
in Lightning. Also if failures are expected to become more rare going
forward, size becomes less relevant to the overall operation of the network.

Another option is to reduce the maximum number of hops. It is questionable
whether 27 hops are really needed in practice, and such long routes also
contribute to latency and capital lock up. If for example the new failure
message could only be used with routes up to 10 hops, the total number of
hmacs would drop from 378 to 55. This makes for a total message size of
about 2 KB.

# Signaling

For backwards compatibility nodes need to know what algorithm they should
run to generate or transform the failure message. This can be signaled by
the sender via a tlv onion field. A failure message format signaling
mechanism is also discussed in the context of long failure messages [3].
The failure message described in this post could be just another version.

Additionally, intermediate nodes need to advertise their capability to
transform the new format through a feature bit.

# Delayed successes

It’s not just failures that can be delayed. Successes can too. In that
case, there is no failure message to improve. It could be an option to add
the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message.

[1]
https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html
[2]
https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors
[3] https://github.com/lightning/bolts/pull/1021
[4]
https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419
[5]
https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221
[6]
https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461
[7]
https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/lightning-dev/attachments/20221019/759cba0c/attachment.html>