[Lightning-dev] Improve Lightning payment reliability through better error attribution

Sat Jun 15 02:53:16 UTC 2019

Good morning Joost,

> Yes that is accurate, although using the time difference between receiving the `update_add_htlc` and sending back the `update_fail_htlc` would work too. It would then include the node's processing time.

It would not work safely.
A node can only propagate an `update_fail_htlc` if the downstream `update_fail_htlc` has been irrevocably committed by `revoke_and_ack`.
See BOLT spec about this.

Suppose we have a route A -> B -> C.
C sends `update_fail_htlc` immediately, but dallies on `revoke_and_ack`.
B cannot send `update_fail_htlc` to A yet, because C can still drop the previous B-C channel state onchain (it is not yet revoked, that is what the `revoke_and_ack` will later do).
If B send `update_fail_htlc` to A as soon as it receives `update_fail_htlc` from C, A can use the new A-B channel state onchain, while at the same time C drops the previous B-C channel state onchain.
the new A-B channel state returns the HTLC to A, while the previous B-C channel state has the HTLC still claimable by C, causing B to lose funds.

For `update_fulfill_htlc` B can immediately propagate to A (without waiting for `update_and_ack` from C) since C is already claiming the money.

Since, B cannot report the `update_fail_htlc` immediately, its timer should still be running.
Suppose we counted only up to `update_fail_htlc` and not on the `revoke_and_ack`.
If C sends `update_fail_htlc` immediately, then the `update_add_htlc`->`update_fail_htlc` time reported by B would be fast.
But if C then does not send `revoke_and_ack`, B cannot safely propagate `update_fail_htlc` to A, so the time reported by A will be slow.
This sudden transition of time from A to B will be blamed on A and B, while C is unpunished.

That is why, for failures, we ***must*** wait for `revoke_and_ack`.
The node must report the time when it can safely propagate the error report upstream, not the time it receives the error report.
For payment fulfillment, `update_fulfill_htlc` is fine without waiting for `revoke_and_ack` since it is always reported immediately upstream anyway.

See my discussion about "fast forwards": https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-April/001986.html

> I think we could indeed do more with the information that we currently have and gather some more by probing. But in the end we would still be sampling a noisy signal. More scenarios to take into account, less accurate results and probably more non-ideal payment attempts. Failed, slow or stuck payments degrade the user experience of lightning, while "fat errors" arguably don't impact the user in a noticeable way.

Fat errors just give you more information when a problem happens for a "real" payment.
But the problem still occurs on the "real" payment and user experience is still degraded.

Background probing gives you the same information **before** problems happen for "real" payments.

Regards,
ZmnSCPxj