Skip to content

Instantly share code, notes, and snippets.

@karalabe
Created November 11, 2020 21:31
Show Gist options
  • Save karalabe/e1891c8a99fdc16c4e60d9713c35401f to your computer and use it in GitHub Desktop.
Save karalabe/e1891c8a99fdc16c4e60d9713c35401f to your computer and use it in GitHub Desktop.
Geth v1.9.17 Post Mortem

Geth v1.9.17 Post Mortem

Yesterday - 11th November, 2020 - a consensus issue was (deliberately) triggered on the Ethereum network. Opposed to the usual way these play out however, this consensus issue was not between different clients, rather between different versions of the same client, namely Geth.

Geth v1.9.7 (released 7th November, 2019) broke the EIP-211 implementation, whereby a memory area was shallow-copied, allowing it to be overwritten out of bounds. The bug was reported by John Youngseok Yang on the 15th July, 2020 and was silently fixed and shipped 5 days later in Geth v1.9.17 (20th July, 2020). This fix brought Geth back into consensus with Besu, Nethermind and OpenEthereum (and the Ethereum specification itself); however it broke consensus with earlier Geth releases.

Unfortunately not all node operators were running recent releases and yesterday morning a transaction managed to trigger the consensus issue, splitting old Geth releases off from the rest of the network. This became a larger issue as Infura was one of the affected parties, hence taking with them their client base.

There was some backlash on Twitter, revolving around two main themes:

  • Why did the Geth team unilaterally make a "consensus upgrade"?
  • Why did the Geth team ship this fix silently instead of warning operators?

Both questions are valid, but as always the answers are more nuanced than what fits into a Twitter thread.

Q: Why did the Geth team unilaterally make a "consensus upgrade"?

The Geth team indeed changed the consensus implementation in the v1.9.17 release, however the team did not create any new rules that the Ethereum community didn't know about or agree to. The rules were defined in EIP-211, and agreed to by the community when the network forked over to Byzantium 3 years ago.

If you don't consider accidentally introducing a bug a "consensus upgrade", then you should also not consider fixing the said bug a few months later a "consensus upgrade".

Q: Why did the Geth team ship this fix silently instead of warning operators?

This is a bit of a grey area and requires a case-by-case discussion. We all agree that transparency is king and that we should strive as much as possible towards it, but it's also important to look at all the details before heads start rolling.

Ethereum's consensus code is relatively stable, so the probability of things breaking get smaller as time passes. However, users also expect us to constantly make things faster, which inherently leads to the occasional introduction of new bugs. Fixing these bugs is not hard - in this case, it was 1 line of code - but shipping the fixes raises some interesting questions.

In the classical software world, once a security fix is created, a platform operator can update all their nodes, or a software vendor can push out the update to all their clients. This minimizes the time window in which an attacker who learns about the bug can abuse it. (Eg. The OpenSSL Heartbleed bug was patched by pretty much all the internet giants in their local infrastructure before anyone even told the public about it).

In the case of Ethereum, it takes a lot of time (weeks, months) to get node operators to update even to a scheduled hard fork. Highlighting that a release contains important consensus or DoS fixes always runs the risk of someone trying to beat updaters to the punch line and taking the network down. Security via obscurity is definitely not something to aim for, but delaying a potential attack by enough to get most node operators immune may be worth the temporary "hit" to transparency.

In this particular instance, the consensus bug was dormant in the code for over 1 year. The probability after all that time for someone to accidentally trigger it is tiny. Opposed to that, the probability of someone maliciously triggering it if highlighted as a security issue is not insignificant. The Geth team made the conscious decision not to mention it, hoping that people eventually upgrade to versions that contain the fix and the issue is gradually ejected from the network.

You might object that "it obviously didn't work".

We'd argue that it actually did work: most nodes have indeed updated and were not affected. From a network health perspective, the strategy worked as intended and the Ethereum network survived without meaningful issues. Certain projects using Infura were impacted, but at the end of the day, the primary goal of the Geth team is the health of the Ethereum network as a whole, individual pieces of it are only secondary.

The decision whether or not to publish details about a serious bug boils down to what the fallout would be in both cases and picking the one where the damage is smaller. Over the past years we've fixed a number of consensus issues never published and helped fix a number of such issues in Nethermind, Besu and Parity, part of them never published. In all these cases, avoiding the limelight allowed the network to seamlessly evolve without running the risk of attacks and without keeping node operators in a constant state of emergency that they need to do immediate updates.

We definitely don't condone doing this for all bugs - and we ourselves published a number of releases where we emphasized their emergency - but at certain times, it's better to remain silent as shown by other projects too such as Monero, ZCash and Bitcoin.

This particular silent consensus fix took an unexpected turn with yesterday's network split, but retrospectively we still believe it was the right call. We understand that we cannot expect operators to always immediately update to the latest releases and appreciate the understanding that our vulnerability management and release structure has nuances unique to a blockchain ecosystem.

@dankrad
Copy link

dankrad commented Nov 11, 2020

I largely agree with the strategy, but not completely with the execution. Here are two questions that should be asked before we forget about this incident.

  • The fix has been released almost 4 months ago. If we truly wished that everyone updates, one of the releases (and not the initial one) could have been marked as critical, say 1-2 weeks down the line. Everyone would have upgraded, leading to a much smaller exploitability window, without revealing the bug
  • I think we cannot have our cake and eat it: We cannot tell people "It's fine to run old versions and may be good for diversity reasons" and at the same time "We want to be able to push stealth consensus fixes". Maybe we can have a more official line that you should run a version from the last n months?

@gmaxwell
Copy link

gmaxwell commented Nov 11, 2020

Why wasn't this issue fixed by (1) Making the code safe but backwards compatible (e.g. eliminating any crash bugs an no longer mining or relaying triggering txn), and (2) prohibiting triggering transactions from the consensus. Then later, if desirable, after announcing it (3) making the behaviour how you prefer it?

For example: In Bitcoin we discovered a extreme corner case in OpenSSL's parsing of BER signatures-- used by essentially all nodes-- where windows and 32-bit bitcoin-qt nodes could be split from all other nodes. Without disclosing the vulnerability we shipped code where nodes would not relay or mine transactions which used anything but the most efficient signature encoding, then we proposed and the network adopted a consensus change where only the most efficient signature encoding was permitted in blocks. This completely resolved the issue and removed the danger, even for outdated software. After that, the issue was announced.

The issue could have been hot cut fixed as was done here, but that would have introduced yet another group of nodes which could have been potentially split. First do no harm.

When protecting user's security has required that some bugs be fixed quietly by making a well disclosed minor change that obviated the bug as a side effect, we usually attempted to resolve an unrelated safe-to-disclose issue in the same or next version to speed up upgrades.

This general procedure of "make it safe and compatible", "make the consensus dangerous operation prohibited from consensus", "disclose it and make it right" has been used successfully many times in Bitcoin going back to 2010 without introducing a consensus split. Only once in Bitcoin's history has a fix contributed to a consensus split-- and that was the removal of the implicit BDB locks limit in 0.8 in 2013, which happened because the fix was inadvertent fix of an issue that was unknown at the time which happened as a side effect of a general improvement.

It's true that you cant control the inadvertent introduction of consensus changes (except by never making mistakes) but you can control the introduction of intentional consensus changes that fix them and make sure that they don't make the issue worse. The idea that the choice is between not publicising a vulnerability or making the fix in a safe(r) way is a false dichotomy: It's usually possible to fix a vulnerability safely without calling attention to it.

Even in the example given in the write-up for Bitcoin the existence of a lesser, uninteresting to exploit, version of the problem was explicitly disclosed at the time of the fix, in contrast to the implication of the misleading click-bait headline-- which made it extremely clear that the fix was critical to everyone. Once enough parties were upgraded to effectively eliminate any risk of a consensus fault the complete details were published.

@vbuterin
Copy link

(2) prohibiting triggering transactions from the consensus

It's not possible to safely tell if a transaction is triggering ahead of time. See this post back when this was first discussed in the context of the proposed DAO soft fork in 2016: https://hackingdistributed.com/2016/06/28/ethereum-soft-fork-dos-vector/

@gmaxwell
Copy link

gmaxwell commented Nov 11, 2020

So the vulnerabilities resulting from effectively undecidable transactions are still unresolved almost 5 years later? Even at time decision is sufficient for most bugs, -- don't mine them. Handling that without creating other problems requires various concessions and mechanisms. E.g. it's why mempooled transactions have a (long) time limit in Bitcoin and why transactions dropped in a reorg are put in a size limited queue and batch processed later.

It's always possible for miners to ignore transactions-- after all, they might just not have received them. Right now if that happens ethereum nodes waste a tremendous amount of cpu resources continually reprocessing omitted transactions, but that's an implementation flaw not too dissimilar to Bitcoin's old behaviour of reorgs taking quadratic time to process due to reprocessed transactions at every block in the reorg.

My point remains: the invocation of 'bitcoin does this too' is substantially incorrect. I don't know about the other mentioned comparables, many altcoins are run rather recklessly because their primary purpose is to dump premined coins on suckers and thoughtful engineering is just an unnecessary cost for that application.

If you don't want to be corrected on where Bitcoin is structurally and culturally resistant to the vulnerabilities you suffer, please don't allege that it's equally exposed.

@vbuterin
Copy link

So the vulnerabilities resulting from effectively undecidable transactions are still unresolved almost 5 years later?

I think it's wrong to call it a vulnerability. It's certainly a property that Ethereum has, and it certainly has its downsides (like the inability to soft fork away bugs), but it also has important upsides: making it more difficult for a 51% coalition to censor specific applications (or even a smaller coalition via feather forking) is a significant win. Complicated tradeoffs!

@ethscriptkiddy
Copy link

ethscriptkiddy commented Nov 12, 2020

Anyone has tried to find the TX who raised the consensus issue?

AFAIK in block 11234783 the following TXIDs make use of RETURNDATASIZE/RETURNDATACOPY (taken from an archive geth node > 1.9.17):

0x0ce5c9410b46bd2f3b79ac2b1531c6a8d7eb3403f4c53ada78778826d72111fa
0x11eb250033f1c636d729f7d2f122e4ab69207a9ee9198f8fb67be403f356396e
0x1564b14a2a1794dfc1481871eb2f947cddf68edc32ac8149e1ddd8b7672f3379
0x1d223737955223dc15c64ff289aeab136b9e79efb107df1cf1556a5628b587df
0x1f3a871dcf8970a423eb75858bd9687e366610e4c7414348bb1f98626b34e6a7
0x24c0a44cc18a5fc5212b5b8902b5c32ea825012dde6f3b9295bef7651792e3e1
0x34bbcf680e41cd56adcaca3a4621416d8d82f64757d570511d4b829af8287568
0x35becd234873ffd75a2e14992a1e37c09355ea7e6eec201582a1c8e744f932c3
0x3864a6abd4b1c85542901659e05a30ac4c73e37e80639014356311ccb2230451
0x44c5705a780e8918cbfec6d7fcacd9ccba78768941defd3bd16d4a405574cf5c
0x4d9add1a740fb0816fe13d7a0e640d429b9557982979187db8d514c271e0a1eb
0x54428147dd04c2315e3db4eed4be948577a8902286d5ab75006f23c282b3bd26
0x55b5cd071a2a2a23ea170e1440d840ca082efede3dd323ec10d874ff7acc359c
0x57f7f9ec3cd92a908ac05edcb372bf6bb984fec6010a360eab76613fbf3bb23f
0x5ba3b26139d557899c6f662503181786691098fe502427564d27c4e5ac08412f
0x5e35b148f5fedf6c934ef46c6c4e64515a3a5f39f161550302b28cfd4b2f231a
0x6359f7d465d1271becb9beae950508e37957dfe41951c9ef5b6ccbd5e7760c73
0x64c53728bcadd4ee16d9d3363d1d2c0205b31febea7a058882cc7a7775887073
0x65d0ad82dfe42e5f8504285cabc105f20167b8554cd2369ad75545d335b05d83
0x6f6c9615292d67f32f907a1eb662ff1eab8147621d850efbdf00397a3901d169
0x7ebc8b1fce06074599f6197268855efc9f0b6f698f5629810783cc00c8b8c816
0x8141c7aa04a901db58d2e94d06015abe6fc7acaaa44a5fd4069eb1856deaa9d2
0x8af5ccc7f8f8af0aab976365dfe9bd4523500de9b319530a4ce5a94707d59c84
0x8d386ba4d5cca2cf905946e9e8f80f9af814887a13a969bf98509389c0f8482e
0x918e24d9bf6df3469fe18baf6f901ff4b2b2521449cf444d5e2168d693803c17
0xa1c591bf2bdf78115add5d9595ccb509bd5a33bc1b900217002cbf07ff99600f
0xbae7a2e4da0f3a5e20d2de97d0a842bd3540487ff027b08e9eaeaf91755b1e5a
0xbc222adbd2aacdb89ac01cd27e02db43c61c6186007cf1393defd17e3119a2d1
0xc77fa616c3afe917f7c2abc6e37858d53e483d52d590c8c7d9e7691f6735a744
0xd05bad82304e13b88caa70e6d8b86b9f7526043ce8e253485c29aaa5fb487e1f
0xd75c80d4acca76e21b33a7b33eccc2a895b601e2b2fef967ff1dbc2aed173f8b
0xde29db7e80b60d3d60480f2139cc0d8bd581f1a48bd54c32916000fa61c0a776
0xe123f4f8a48beb329de6c668e828edade20c81313448bbbdaf4c9c584faf4225
0xf380ba0e2e4f2ed4acf5f604cf9fdda40debf6edd96ac0c04a3b09742a28e154
0xfc5cc4bcb885956dad8ae8a1fcb0f7ded796535ff95c99337510bdea64a60718

If not enough nodes have upgraded to 1.9.17 (which is doubtful) it would be fun to try to break the consensus again. This would be a significant win!

edit: it was 0x57f7f9ec3cd92a908ac05edcb372bf6bb984fec6010a360eab76613fbf3bb23f (https://blog.ethereum.org/2020/11/12/geth_security_release/)

@kaykurokawa
Copy link

It seems like there was never even a plan to disclose the consensus bug/fix (or any of the "number of consensus issues") within a certain timeline. Sure, it makes sense not to disclose it for a week up to perhaps a month and wait till you can get sufficient important nodes to update. However, it seems negligent to just ignore the disclosure and hope no one notices the issue.

@chejazi
Copy link

chejazi commented Nov 12, 2020

Over the past years we've fixed a number of consensus issues never published

It's troubling to think that there may be undisclosed, potentially consensus-breaking issues running in other versions of Geth. Are there a non-trivial number of nodes online facing similarly undisclosed issues as we speak? We don't know!

avoiding the limelight allowed the network to seamlessly evolve without running the risk of attacks

Just because it didn't happen on other occasions doesn't mean the risk doesn't exist. A consensus failure is an attack on the (health of the) network. The risk is always there, as shown by the events that just unfolded.

@rockstardev
Copy link

So you're saying that all we need to do in order to split the Ethereum network:

  1. Follow "consensus fixes force pushed to production" you guys do
  2. Give it a bit time considering you leave old vulnerable versions without any notice for 5+ months
  3. We send in transaction that exploits the problem, and then dump double spends on those that haven't upgraded

I'm in awe of your l33t engineering 👌

@alexkroeger
Copy link

alexkroeger commented Nov 12, 2020

Appreciate the transparency here. If an update fixes a consensus bug, however, it seems worth it to reach out quietly to major node providers (Infura, Alchemy) to make sure they upgrade in a timely manner.

If Infura was aware of the implications of the fix, it's hard to imagine they would wait a year to update their Geth version.

@daira
Copy link

daira commented Nov 12, 2020

Speaking from the perspective of a Zcash dev: one reason we're less vulnerable to problems like this is that any given zcashd node version will shut down approximately 16 weeks after its release, according to block height. Everyone in the ecosystem knows this and can plan upgrades around it. (We don't yet have multiple consensus-compatible node implementions, but we will do fairly soon.)

In fact we have had a small number of accidentally introduced consensus changes in zcashd that never triggered a fork because all of the affected node versions reached "End-of-Service halt" without that happening. Importantly, we knew exactly when the affected nodes would reach End-of-Service halt, and could make decisions about how to handle a given bug on that basis.

We also rely on this 16-week period to coordinate network upgrade timing (for example, zcashd 3.1.0 and earlier nodes that do not support the upcoming Canopy upgrade will EoS-halt before that upgrade). Our network upgrades are bilateral hard forks.

Please take this just as a suggestion and an explanation of what Zcash and zcashd does. I'm aware that it would be a major policy change for Ethereum node implementations, and has disadvantages as well as advantages.

@zack-bitcoin
Copy link

A soft fork doesnt trigger dos like vitalik is saying.
Only a poorly implemented soft fork has that flaw.

If a tx appears that exploits the bug, the miners should save a copy of that tx, and gossip it to other miners. So the miners have a list of accounts that have attempted to trigger the bug.

Until the hard update fix activates, miners should censor any tx from any account that exploited the bug.
So each account can only make a bad tx once, and we can punish the bad accounts as a part of the hard update to recover.

@daira
Copy link

daira commented Nov 12, 2020

It seems to me that the approach @gmaxwell is suggesting wouldn't have worked in this case, and indeed wouldn't be expected to work in most situations of accidental consensus incompatibility in a coin with multiple consensus implementations.

In particular, adding bug-for-bug compatibility code to other node implementations to account for Geth's bug would have been absolutely the wrong approach, and it would only have delayed solving the problem, given that operators of those other nodes would also have needed to upgrade.

In response to @zack-bitcoin's comment, punishing whoever triggered the fork is beside the point, since no-one is alleging malfeasance.

FWIW, I think this bug was reasonably well-handled given the constraints, although I agree with @alexkroeger that giving infrastructure providers a heads up might have helped.

@perlboy
Copy link

perlboy commented Nov 12, 2020

No one else is going to ask why pretty important node operators were running releases >3 months old? Any self respecting and "old world" accredited IT shop would classify this as a classic patch management failure and most (all?) commercial vendors would point to this lag time in their defence...

@t-vila
Copy link

t-vila commented Nov 12, 2020

No one else is going to ask why pretty important node operators were running releases >3 months old? Any self respecting and "old world" accredited IT shop would classify this as a classic patch management failure and most (all?) commercial vendors would point to this lag time in their defence...

see Infura's explanation around that https://blog.infura.io/infura-mainnet-outage-post-mortem-2020-11-11/?utm_source=social&utm_medium=twitter

@mcelrath
Copy link

So the vulnerabilities resulting from effectively undecidable transactions are still unresolved almost 5 years later?

I think it's wrong to call it a vulnerability. It's certainly a property that Ethereum has, and it certainly has its downsides (like the inability to soft fork away bugs), but it also has important upsides: making it more difficult for a 51% coalition to censor specific applications (or even a smaller coalition via feather forking) is a significant win. Complicated tradeoffs!

A 51% coalition can censor applications or transactions, full stop. There is fundamentally no way around this security property. Bringing that up is a misdirection, there is no trade-off here. Whether to call state dependence a "property" of Ethereum or calling it "degraded upgrade safety" is in the eye of the beholder, but it has nothing to do with 51% attacks.

@JavierGonzalez
Copy link

Neutralizing a minority split with empty blocks + reorg would reduce the impact on users to zero.

If the split was 30 blocks and the block reward is 1,255 USD, the total cost would be 37,650 USD? (sorry if this is not correct, I have not studied ETH in depth).

Is it too high a cost in return for protecting users and the ETH project reputation?

@holiman
Copy link

holiman commented Nov 13, 2020

A 51% coalition can censor applications or transactions, full stop. There is fundamentally no way around this security property. Bringing that up is a misdirection, there is no trade-off here

Yes, they can censor all transactions, or they can whitelist a set of senders. What is difficult is to censor an application. As in, censoring transactions that somewhere along the execution invokes a certain contract. Because then they have to execute it in order to figure it out, and the path the execution takes may depend on the state, so two txs executed out of order may not act the same way as the same two txs mined in a different order.

Right now, tx validity can be determined without executing the transaction. So they do execute them today, but they know that they can put it into a block and get paid for it. If they want to censor, they have to execute e.g. a 10M gas transaction, and possibly have to censor it, and thus not get paid for it. Which is exploitable.

So saying "making it more difficult for a 51% coalition to censor specific applications" is a very accurate description.

@mcelrath
Copy link

A 51% coalition can censor applications or transactions, full stop. There is fundamentally no way around this security property. Bringing that up is a misdirection, there is no trade-off here

Yes, they can censor all transactions, or they can whitelist a set of senders. What is difficult is to censor an application. As in, censoring transactions that somewhere along the execution invokes a certain contract. Because then they have to execute it in order to figure it out, and the path the execution takes may depend on the state, so two txs executed out of order may not act the same way as the same two txs mined in a different order.

Right now, tx validity can be determined without executing the transaction. So they do execute them today, but they know that they can put it into a block and get paid for it. If they want to censor, they have to execute e.g. a 10M gas transaction, and possibly have to censor it, and thus not get paid for it. Which is exploitable.

So saying "making it more difficult for a 51% coalition to censor specific applications" is a very accurate description.

So someone who wants to censor an application has to execute the contract, and see what it calls. No big deal. The "more difficult" you claim is a tiny amount of CPU usage that, if you're determined to censor, you'd be happy to eat the cost for. Just because that's not how it works today does not make this a valid security argument.

So again, this "51%" argument is totally irrelevant.

@kurtseifried
Copy link

This was a clearly triggerable DoS in the software, as such it clearly meets the definition for getting a CVE identifier. Is there a CVE request in the works for this issue? Thanks.

@hooji
Copy link

hooji commented Nov 15, 2020

Perhaps new client releases which include protocol changes should include a "sunset" target block number by which all prior versions are required to be upgraded. This could be implemented by contract on the blockchain itself. Legacy clients could read the value to learn of required updates and the remaining time available before the upgrade becomes "required".

@JesseHerring33
Copy link

I need to learn more about it or hire people to help me with management skills and organization with my addresses. I know there is 20 addeess that belongs to me and I need to connect the rest at etherscan, etherscan has most of them but with the community of GitHub I'm going to give large donations for all the help over the past years! You all are so awesome and God bless you! We all go through our own Demons and I promise I have my own and they seem a little bit more aggressive than others but it's not my judgment; ) I accidentally sent transactions and I have had network problems and everything else that could interfere with my life but you all are my Family and I prey for you and just know it gets so much better than now, of course we will have problems bringing Heaven to earth. There is a lot going on and I need to be in a position where I can talk about it and let people know that God is good

@daira
Copy link

daira commented Apr 5, 2021

I suggest locking this bug and deleting the spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment