[bitcoin-dev] Libre/Open blockchain / cryptographic ASICs
ZmnSCPxj at protonmail.com
Thu Feb 11 08:20:54 UTC 2021
Good morning Luke,
> > (to be fair, there were tools to force you to improve coverage by injecting faults to your RTL, e.g. it would virtually flip an `&&` to an `||` and if none of your tests signaled an error it would complain that your test coverage sucked.)
It should be possible for a tool to be developed to parse a Verilog RTL design, then generate a new version of it with one change.
Then you could add some automation to run a set of testcases around mutated variants of the design.
For example, it could create a "wrapper" module that connects to an unmutated differently-named version of the design, and various mutated versions, wire all their inputs together, then compare outputs.
If the testcase could trigger an output of a mutated version to be different from the reference version, then we would consider that mutation covered by that testcase.
Possibly that could be done with Verilog-2001 file writing code in the wrapper module to dump out which mutations were covered, then a summary program could just read in the generated file.
Or Verilog plugins could be used as well (Icarus supports this, that is how it implements all `$` functions).
A drawback is that just because an output is different does not mean the testcase actually ***checks*** that output.
If the testcase does not detect the diverging output it could still not be properly covering that.
The point of this is to check coverage of the tests.
Not sure how well this works with formal validation.
> > Synthesis in particular is a black box and each vendor keeps their particular implementations and tricks secret.
> sigh. i think that's partly because they have to insert diodes, and buffers, and generally mess with the netlist.
> i was stunned to learn that in a 28nm ASIC, 50% of it is repeater-buffers!
Well, that surprises me as well.
On the other hand, smaller technologies consistently have lower raw output current driving capability due to the smaller size, and as trace width goes down and frequency goes up they stop acting like ideal 0-impedance traces and start acting more like transmission lines.
So I suppose at some point something like that would occur and I should not actually be surprised.
(Maybe I am more surprised that it reached that level at that technology size, I would have thought 33% at 7nm.)
In the modules where we were doing manual netlist+layout, we used inverting buffers instead (slightly smaller than non-inverrting buffers, in most technologies a non-inverting buffer is just an inverter followed by an inverting buffer), it was an advantage of manual design since it looks like synthesis tools are not willing to invert the contents of intermediate flip-lfops even if it could give theoretical speed+size advantage to use an inverting buffer rather than an non-inverting one (it looks like synthesis optimization starts at the output of flip-flops and ends at their input, so a manual designer could achieve slightly better performance if they were willing to invert an intermediate flip-flop).
Another was that inverting latches were smaller in the technology we were using than non-inverting latches, so it was perfectly natural for us to use an inverting latch and an inverting buffer on those parts where we needed higher fan-out (t was equivalent to a "custom" latch that had higher-than-normal driving capability).
Scan chain test generation was impossible though, as those require flip-flops, not latches.
Fortunately this was "just" deserialization of high-frequency low-width data with no transformation of the data (that was done after the deserialization, at lower clock speeds but higher data width, in pure RTL so flip-flops), so it was judged acceptable that it would not be covered by scan chain, since scan chain is primarily for testing combinational logic between flip-flops.
So we just had flip-flops at the input, and flip-flops at the output, and forced all latches to pass-through mode, during scan mode.
We just needed to have enough coverage to uncover stuck-at faults (which was still a pain, since additional test vectors slow down manufacturing so we had to reduce the test vectors to the minimum possible) in non-scan-momde testing.
Man, making ASICs was tough.
> plus, they make an awful lot of money, it is good business.
> > Pointing some funding at the open-source Icarus Verilog might also fit, as it lost its ability to do synthesis more than a decade ago due to inability to maintain.
> ah i didn't know it could do synthesis at all! i thought it was simulation only.
Icarus was the only open-source synthesis tool I could find back then, and it dropped synthesis capability fairly early due to maintenance burden (I never managed to get the old version with synthesis compiled and never managed actual synthesis on it, so my knowledge of it is theoretical).
There is an argument that open-source software is not truly open-source unless it can be compiled by open-source compilers or executed by open-source interpreters.
Similarly, I think open-source hardware RTL designs are not truly open-source if there are no open-source synthesis tools that can synthesize it to netlist and then lay it out.
Icarus can interpret most Veriog RTL designs, though.
However, at the time I left, I had already mandated that new code should use `always_comb` and `always_ff` (previously I had mandated that new code should use `always @*` for combinational logic) and was encouraging my subordinates to use loops and `generate`.
Icarus did not support `always_comb` and `always_ff` at the time (though worked perfectly fine with loops and `generate`).
In addition, at the time, we (actually just me in that company haha) were dabbling in object-oriented testing methodologies (which Icarus has no plans on ever implementing, which is understandable since it is a massive increase in complexity, it is much much harder than the scheduling shenanigans of `always_comb` and the "just treat it as `always`" of `always_ff`).
(Particularly, you need object-oriented testbenches since SystemVerilog includes a fuzz-testing framework to randomize fields of objects according to certain engineer-provided constraints, and then you would use those object fields to derive the test vectors your test framework would feed into the DUT, this was a massive increase in code coverage for a largish up-front cost but once you built the test framework you could just dump various constraints on your test specification objects, I actually caught a few bugs that we would not have otherwise found with our previous checklist-based testing methodology.)
(Unfortunately it turned out that it required a more expensive license and I ended up hogging the only one we had of that more expensive license (which, if I remember correctly, was the same license needed for formal verification of netlist<->RTL equivalence) for this, which killed enthusiasm for this technique, sigh, this is another argument for getting open-source hardware design tools developed; not much sense in having open-source RTL for a crypto device if you have to pay through the nose for a license just to synthesize it, never mind the manufacturing cost.)
Another point to ponder is test modes.
In mass production you **need** test modes.
There will always be some number of manufacturing defects because even the cleanest of cleanrooms *will* have a tiny amount of contaminants (what can go wrong will go wrong).
Test modes are run in manufacturing to filter out chips with failing circuitry due to contamination.
Now, a typical way of implementing test modes is to have a special command sent over, say, the "normal" serial port interface of a chip, which then enters various test modes to allow, say, scan chain testing.
Of course, scan chain testing is done by pushing test vectors into all flip-flops, and then the test is validated by pulsing global clock once (often the test mode forces all flip-flops on the same clock), then pulling data from all flip-flops to verify that all the circuitry works as designed.
The "pulling data from all flip-flops" is of course just another way of saying that all mass-produced chips have a way of letting ***anyone*** exfiltrate data from their flip-flops via test modes.
Thus, for a secure environment, you need to ensure that test modes cannot be entered after the device enters normal operation.
For example, you might have a dedicated pad which is normally pulled-down, but if at reset it is pulled up, the device enters test mode.
If at reset the pad is pulled down, the device is in normal mode and even if the pad is pulled up afterwards the device will not enter test mode.
This ensures that only reset data can be read from the device, without possibility of exfiltration of sensitive (key material or midstate) data.
The pad should also not be exposed as a package pinout except perhaps on DS and ES packages, and the pulldown resistor has to be on-chip.
As an additional precaution, we can also create a small secure memory (maybe 256 octet addressable would be more than enough).
It is possible to exempt flip-flops from scan chain generation (usually by explicitly instantiating flip-flops in a separate module and telling post-synthesis tools to exempt the module from scan chain synthesis).
This gives an extra layer of protection against test mode accessing sensitive data; even if we manage to screw up test mode and it is possible to force reset on the test mode circuit without resetting the rest of the design, sensitive data is still out of the scan chain.
Of course, since they are not on scan, it is possible they have undetectable manufacturing defects, so you would need to use some kind of ECC, or better triple-redundancy best-of-three, to protect against manufacturing defects on the non-scan flip-flops.
Fortunately non-scan flip-flops are often a good bit smaller than scan flip-flops, so the redundancy is not so onerous.
Since the ECC / best-of-three circuit itself would need to be tested, you would multiplex their inputs, in normal mode they get inputs from the non-scan-chain flip-flops, in test mode they get inputs from separate scan-chain flip-flops, so that the ECC / best-of-three circuit is testable at scan mode.
You would also need a separate test of the secure memory, this time running in normal mode with a special test program in the CPU, just in case.
Finally, you would explicitly lay them out "distributed" around the chip, since manufacturing defects tend to correlate in space (they are usually from dust, and dust particles can be large relative to cell size), you do not want all three of the best-of-three to have manufacturing defects.
For example, you could have a 256 x 8 non-scan-chain flip-flop module, instantiate three of those, and explicitly place them in corners of the digital area, then use a best-of-three circuit to resolve the "correct" value.
The test mode circuit itself could ensure that the device enters test mode if and only if the secure memory contains all 0 data after the test mode circuit is reset.
For example, the 256 x 8 non-scan-chain flip-flop module could have a large OR circuit that ORs all the flip-flops, then outputs a single bit that is the bitwise OR of all the flip-flop contents.
Then the test mode circuit gets the `in_use` outputs fo the three secure flip-flop modules, and if at reset any of them are `1` then it will refuse to enter test mode even if the test mode pad is pulled high.
This ensures that even if an attacker is somehow able to reset *only* the test mode circuit somehow (this is basic engineering, always assume something will go wrong), if the secure memory has any non-0 data (we presume it resets to 0), the device will still not enter test mode.
Of course, if the secure memory itself is accessible from the CPU, then it remains possible that a CPU program is reading from the secure area, keeping raw data in CPU registers, from which a test-mode might be able to extract if the device is somehow forced into test mode even after normal mode.
You could redesign your implementations of field multiplication and SHA midstate computation so that they directly read from the secure memory and write to the secure memory without using any flip-flops along the way, and have only the cryptographic circuit have access to the secure memory.
That way there is reduced possibility that intermediate flip-flops (that are part of the scan chain) outside the secure memory having sensitive key material or midstate data.
You would need to use a custom bus with separate read and write addresses, and non-pipelined unbuffered access, and since you want to distribute your secure memory physically distant, that translates to wide and long buses (it might be better to use 64 x 32 or 32 x 64 addressable memories, to increase what the cryptographic circuit has access to per clock cycle) screwing with your layout, and probably having to run the secure memory + crypto circuit at a ***much*** slower clock domain (but more secure is a good tradeoff for slowness).
Of course, that is a major design headache (the crypto circuit has to act mostly as a reduced-functionality processor), so you might just want to have the CPU directly access the secure memory and in early boot poke a `0x01` in some part of the memory, in the hope that the `in_use` flag in the previous paragraph is enough to suppress test modes from exfiltrating CPU registers.
Do note that with enough power-cycles and ESD noise you can put digital circuitry into really weird and unexpected states (seen it happen, though fairly hard to replicate, we had an ESD gun you could point at a chip to make it go into weird states), so being extra paranoid about test modes is important.
What can go wrong will go wrong!
In particular with "`TESTMODE_PAD` is only checked at reset" you would have to store `TESTMODE` in a non-scan flip-flop, and with enough targeted ESD that flip-flop can be jostled, setting `TESTMODE` even after normal operation.
You might instead want to use, say, a byte pattern instead of a single bit to represent `TESTMODE`, so the `TESTMODE` register has to have a specific value such as `0xA5`, so that targeted ESD has to be very lucky in order to force your device into test mode.
For example, since you need to check the `TESTMODE` pad at reset anyway, you could do something like this:
input CLK, RESET_N, TESTMODE_PAD, IN_USE0, IN_USE1, IN_USE2;
output reg TESTMODE;
wire in_use = IN_USE0 || IN_USE1 || IN_USE2;
reg [7:0] testmode_ff;
wire [7:0] next_testmode_ff =
(testmode_ff == 8'hA5 || testmode_ff == 8'h00) ?
(TESTMODE_PAD && !in_use) ? 8'hA5 :
/*otherwise*/ 8'h5A :
/*otherwise*/ testmode_ff ;
always_ff @(posedge CLK, negedge RESET_N) begin
if (!RESET_N) testmode_ff <= 0x00;
else testmode_ff <= next_testmode_ff; end
wire next_TESTMODE = (testmode_ff == 8'hA5);
always_ff @(posedge CLK, negedge RESET_N) begin
if (!RESET_N) TESTMODE <= 1'b0;
else TESTMODE <= next_TESTMODE; end
Do note that the `TESTMODE` is a flip-flop, since you do ***not*** want glitches on the `TESTMODE` signal line, it would be horribly unsafe to output it from combinational circuitry directly, please do not do that.
Of course that flip-flop can instead be the target of ESD gunnery, but since you need many clock pulses to read the scan chain, it should with good probability also get set to `0` on the next clock pulse and leave test mode (and probably crash the device as well until full reset, but this "fails safe" since at least sensitive data cannot be extracted).
`TESTMODE` has no feedback, thus cannot be stuck in a state loop.
`testmode_ff` *can* be stuck in a state loop, but that is deliberate, as it would "fail safe" if it gets a value other than `0xA5`, it would not enter test mode (and if it enters `0xA5` it can easily leave test mode by either `TESTMODE_PAD` or `in_use`).
(Sure, an attacker can try targeted ESD at the `TESTMODE` flip-flop repeatedly, but this risks also flipping other scan flip-flops that contain the data that is being extracted, so this might be sufficient protection in practice.)
If you are really going to open-source the hardware design then the layout is also open and attackers can probably target specific chip area for ESD pulse to try a flip-flop upset, so you need to be extra careful.
Note as well that even closed-source "secure" elements can be reverse-engineered (I used to do this in the IC design job as a junior engineer, it was the sort of shitty brain-numbing work forced on new hires), so security-by-obscurity does have a limit as well, it should be possible to try to figure out the testmode circuitry on "secure" elements and try to get targeted ESD upsets at flip-flops on the testmode circuit.
Test mode design is something of an arcane art, especially if you are trying to build a security device, on the one hand you need to ensure you deliver devices without manufacturing defects, on the other hand you need to ensure that the test mode is not entered inadvertently by strange conditions.
In general, because test modes are such a pain to deal with securely, and are an absolute necessity for mass production, you should assume that any "secure" chip can be broken by physical access and shooting short-range ESD pulses at it to try to get it into some test mode, unless it is openly designed to prevent test mode from persisting after entering normal mode, as above.
(No idea how that ESD gun thing worked or what it was formally called, we just called it the ESD gun, it was an amusing toy, you point it at the DUT and pull the trigger and suddenly it would switch modes, this of course was a bad thing since you want to make sure that as much as possible such upsets do not cause the chip to enter an irrecoverable mode but an amusing thing to do still, we even had small amounts of flash memory containing register settings that we would load into the settings registers periodically at the end of each display frame to protect against this kind of ESD gun thing since the flip-flops backing the settings registers were vulnerable to it and we needed a way to preserve the settings of the customer for the IC, the expected effect would be to cause the display to flicker.)
More information about the bitcoin-dev