[Lightning-dev] Quick analysis of channel_update data

Tue Jan 8 15:15:04 UTC 2019

Rusty Russell <rusty at rustcorp.com.au> writes:
>> But only 18 000 pairs of channel updates carry actual fee and/or HTLC
>> value change. 85% of the time, we just queried information that we
>> already had!
>
> Note that this can happen in two legitimate cases:
> 1. The weekly refresh of channel_update.
> 2. A node updated too fast (A->B->A) and the ->A update caught up with the
>    ->B update.
>  
> Fortunately, this seems fairly easy to handle: discard the newer
> duplicate (unless > 1 week old).  For future more advanced
> reconstruction schemes (eg. INV or minisketch), we could remember the
> latest timestamp of the duplicate, so we can avoid requesting it again.

Unfortunately this assumes that you have a single update partner, and
still results in flaps, and might even result in a stuck state for some
channels.

Assume that we have a network in which a node D receives the updates
from a node A through two or more separate paths:

A --- B --- D
 \--- C ---/

And let's assume that some channel of A (c_A) is flapping (not the ones
to B and C). A will send out two updates, one disables and the other one
re-enables c_A, otherwise they are identical (timestamp and signature
are different as well of course). The flush interval in B is sufficient
to see both updates before flushing, hence both updates get dropped and
nothing apparently changed (D doesn't get told about anything from
B). The flush interval of C triggers after getting the re-enable, and D
gets the disabling update, followed by the enabling update once C's
flush interval triggers again. Worse if the connection A-C gets severed
between the updates, now C and D learned that the channel is disabled
and will not get the re-enabling update since B has dropped that one
altogether. If B now gets told by D about the disable, it'll also go
"ok, I'll disable it as well", leaving the entire network believing that
the channel is disabled.

This is really hard to debug, since A has sent a re-enabling
channel_update, but everybody is stuck in the old state.

At least locally updating timestamp and signature for identical updates
and then not broadcasting if they were the only changes would at least
prevent the last issue of overriding a dropped state with an earlier
one, but it'd still leave C and D in an inconsistent state until we have
some sort of passive sync that compares routing tables and fixes these
issues.

>> Adding a basic checksum (4 bytes for example) that covers fees and
>> HTLC min/max value to our channel range queries would be a significant
>> improvement and I will add this the open BOLT 1.1 proposal to extend
>> queries with timestamps.
>>
>> I also think that such a checksum could also be used
>> - in “inventory” based gossip messages
>> - in set reconciliation schemes: we could reconcile [channel id |
>> timestamp | checksum] first
>
> I think this is overkill?

I think all the bolted on things are pretty much overkill at this point,
it is unlikely that we will get any consistency in our views of the
routing table, but that's actually not needed to route, and we should
consider this a best effort gossip protocol anyway. If the routing
protocol is too chatty, we should make efforts towards local policies at
the senders of the update to reduce the number of flapping updates, not
build in-network deduplications. Maybe something like "eager-disable"
and "lazy-enable" is what we should go for, in which disables are sent
right away, and enables are put on an exponential backoff timeout (after
all what use are flappy nodes for routing?).

Cheers,
Christian