[Bitcoin-development] On bitcoin testing

Gregory Maxwell gmaxwell at gmail.com
Wed Oct 10 00:03:52 UTC 2012

On Tue, Oct 9, 2012 at 7:12 PM, Jeff Garzik <jgarzik at exmulti.com> wrote:
> * Data-driven tests: If possible, write software-neutral, data-driven
> tests.  This enables clients other than the reference one (Satoshi
> client) to be tested.  Embed tests in testnet3 chain, if possible.

The mention of testnet3 here reminds me to make a point:  Confirmation
bias is a common problem for software testing— people often over test
the success cases and under-test the failure cases.  This is certainly
the case in Bitcoin: For example, testnet3+the packaged tests test all
the branches inside the interior script evaluation engine _except_ the
rejection cases.

For us failure cases can be harder to package up (e.g. can't be placed
in testnet) but Matt's node-simulation based tester provides a good
example of how to create a data driven test set that tests both
failure cases and dynamic behavior (e.g. reorgs).

Testing of failure cases is absolutely critical for testing of
implementation compatibility: The existence of a difference in what
gets rejected in a widely deployed alternative node could result in an
utterly devastating network split.

Generally every test of something which must succeeded should be
matched by a test of something that must fail. Personally, I like to
test the boundary cases— e.g. if something has an allowed range of
[0-8], I'll test -1,0,8,9 at a minimum. Though reasoning trumps rules
of thumb.

Confirmation bias is another reason why it's important to have a more
diverse collection of testers than the core developers.  People who
work closely with the software have strong expectations of how the
software should work and are less likely to test crazy corner cases
because they "know" the outcome, sometimes erroneously.

To reinforce Jeff's list of different approaches: I've long found that
each mechanism of software testing has diminishing returns the more of
it you apply. So you're best off using as many different approaches a
little rather than spending all your resources going as deep as
possible with any one approach.

There are also some kind of testing which are synergistic: Almost all
testing is enhanced enormously by combining it with valgrind because
it substantially lowers the threshold of issue detection substantially
(e.g. detecting bogus memory accesses which are _currently_ causing a
crash for you but could). If I could only test one of "with valgrind"
or "without" I'd test with every time.  Sadly valgrind doesn't exist
on windows and it's rather slow. Dr. Memory
(http://code.google.com/p/drmemory/) may be an alternative on Windows,
and there is work to port ASAN to GCC so it may be possible to mingw
ASAN builds in not too long.

I've also found that any highly automatable testing (coded data
driven, unit, and fuzz testing) combines well with diverse
compilation, e.g. building on as many system types and architectures—
including production irrelevant ones— as possible in the hopes that
some system specific quark make a bug easier to detect.

More information about the bitcoin-dev mailing list