Yesterday - 11th November, 2020 - a consensus issue was (deliberately) triggered on the Ethereum network. Opposed to the usual way these play out however, this consensus issue was not between different clients, rather between different versions of the same client, namely Geth.
Geth v1.9.7 (released 7th November, 2019) broke the EIP-211 implementation, whereby a memory area was shallow-copied, allowing it to be overwritten out of bounds. The bug was reported by John Youngseok Yang on the 15th July, 2020 and was silently fixed and shipped 5 days later in Geth v1.9.17 (20th July, 2020). This fix brought Geth back into consensus with Besu, Nethermind and OpenEthereum (and the Ethereum specification itself); however it broke consensus with earlier Geth releases.
Unfortunately not all node operators were running recent releases and yesterday morning a transaction managed to trigger the consensus issue, splitting old Geth releases off from the rest of the network. This became a larger issue as Infura was one of the affected parties, hence taking with them their client base.
There was some backlash on Twitter, revolving around two main themes:
- Why did the Geth team unilaterally make a "consensus upgrade"?
- Why did the Geth team ship this fix silently instead of warning operators?
Both questions are valid, but as always the answers are more nuanced than what fits into a Twitter thread.
Q: Why did the Geth team unilaterally make a "consensus upgrade"?
The Geth team indeed changed the consensus implementation in the v1.9.17 release, however the team did not create any new rules that the Ethereum community didn't know about or agree to. The rules were defined in EIP-211, and agreed to by the community when the network forked over to Byzantium 3 years ago.
If you don't consider accidentally introducing a bug a "consensus upgrade", then you should also not consider fixing the said bug a few months later a "consensus upgrade".
Q: Why did the Geth team ship this fix silently instead of warning operators?
This is a bit of a grey area and requires a case-by-case discussion. We all agree that transparency is king and that we should strive as much as possible towards it, but it's also important to look at all the details before heads start rolling.
Ethereum's consensus code is relatively stable, so the probability of things breaking get smaller as time passes. However, users also expect us to constantly make things faster, which inherently leads to the occasional introduction of new bugs. Fixing these bugs is not hard - in this case, it was 1 line of code - but shipping the fixes raises some interesting questions.
In the classical software world, once a security fix is created, a platform operator can update all their nodes, or a software vendor can push out the update to all their clients. This minimizes the time window in which an attacker who learns about the bug can abuse it. (Eg. The OpenSSL Heartbleed bug was patched by pretty much all the internet giants in their local infrastructure before anyone even told the public about it).
In the case of Ethereum, it takes a lot of time (weeks, months) to get node operators to update even to a scheduled hard fork. Highlighting that a release contains important consensus or DoS fixes always runs the risk of someone trying to beat updaters to the punch line and taking the network down. Security via obscurity is definitely not something to aim for, but delaying a potential attack by enough to get most node operators immune may be worth the temporary "hit" to transparency.
In this particular instance, the consensus bug was dormant in the code for over 1 year. The probability after all that time for someone to accidentally trigger it is tiny. Opposed to that, the probability of someone maliciously triggering it if highlighted as a security issue is not insignificant. The Geth team made the conscious decision not to mention it, hoping that people eventually upgrade to versions that contain the fix and the issue is gradually ejected from the network.
You might object that "it obviously didn't work".
We'd argue that it actually did work: most nodes have indeed updated and were not affected. From a network health perspective, the strategy worked as intended and the Ethereum network survived without meaningful issues. Certain projects using Infura were impacted, but at the end of the day, the primary goal of the Geth team is the health of the Ethereum network as a whole, individual pieces of it are only secondary.
The decision whether or not to publish details about a serious bug boils down to what the fallout would be in both cases and picking the one where the damage is smaller. Over the past years we've fixed a number of consensus issues never published and helped fix a number of such issues in Nethermind, Besu and Parity, part of them never published. In all these cases, avoiding the limelight allowed the network to seamlessly evolve without running the risk of attacks and without keeping node operators in a constant state of emergency that they need to do immediate updates.
We definitely don't condone doing this for all bugs - and we ourselves published a number of releases where we emphasized their emergency - but at certain times, it's better to remain silent as shown by other projects too such as Monero, ZCash and Bitcoin.
This particular silent consensus fix took an unexpected turn with yesterday's network split, but retrospectively we still believe it was the right call. We understand that we cannot expect operators to always immediately update to the latest releases and appreciate the understanding that our vulnerability management and release structure has nuances unique to a blockchain ecosystem.
Why wasn't this issue fixed by (1) Making the code safe but backwards compatible (e.g. eliminating any crash bugs an no longer mining or relaying triggering txn), and (2) prohibiting triggering transactions from the consensus. Then later, if desirable, after announcing it (3) making the behaviour how you prefer it?
For example: In Bitcoin we discovered a extreme corner case in OpenSSL's parsing of BER signatures-- used by essentially all nodes-- where windows and 32-bit bitcoin-qt nodes could be split from all other nodes. Without disclosing the vulnerability we shipped code where nodes would not relay or mine transactions which used anything but the most efficient signature encoding, then we proposed and the network adopted a consensus change where only the most efficient signature encoding was permitted in blocks. This completely resolved the issue and removed the danger, even for outdated software. After that, the issue was announced.
The issue could have been hot cut fixed as was done here, but that would have introduced yet another group of nodes which could have been potentially split. First do no harm.
When protecting user's security has required that some bugs be fixed quietly by making a well disclosed minor change that obviated the bug as a side effect, we usually attempted to resolve an unrelated safe-to-disclose issue in the same or next version to speed up upgrades.
This general procedure of "make it safe and compatible", "make the consensus dangerous operation prohibited from consensus", "disclose it and make it right" has been used successfully many times in Bitcoin going back to 2010 without introducing a consensus split. Only once in Bitcoin's history has a fix contributed to a consensus split-- and that was the removal of the implicit BDB locks limit in 0.8 in 2013, which happened because the fix was inadvertent fix of an issue that was unknown at the time which happened as a side effect of a general improvement.
It's true that you cant control the inadvertent introduction of consensus changes (except by never making mistakes) but you can control the introduction of intentional consensus changes that fix them and make sure that they don't make the issue worse. The idea that the choice is between not publicising a vulnerability or making the fix in a safe(r) way is a false dichotomy: It's usually possible to fix a vulnerability safely without calling attention to it.
Even in the example given in the write-up for Bitcoin the existence of a lesser, uninteresting to exploit, version of the problem was explicitly disclosed at the time of the fix, in contrast to the implication of the misleading click-bait headline-- which made it extremely clear that the fix was critical to everyone. Once enough parties were upgraded to effectively eliminate any risk of a consensus fault the complete details were published.