There’s a recurring discussion around Ethereum’s need for 1.6 terabytes of disk space (and that number is growing fast). The implication is that it's no longer practical to run fully-validating Ethereum nodes on consumer hardware and, therefore, the project is doomed.
Fortunately, the disk bloat anxiety is a non-issue, but it's a topic that’s nuanced, which facilitates the spread of misleading, sensationalist headlines. Afri Schoedon, a Parity developer, attempted to explain the disk usage in great detail in Nov 2017 as have others in blogs and social media, yet the confusion persists.
This post is another attempt to dispel the Ethereum disk bloat anxiety.
At a high level, the fact that Ethereum can be configured to use 1.6TB does not mean it needs 1.6TB to be a fully validating node.
For context, as of October 2018, Bitcoin (core), with the transaction index enabled, uses 162GB of disk space: ./bitcoind txindex=1
Geth, with the --syncmode=full CLI flag, uses 113GB of disk. This is a "full node". It validates everything as it downloads starting from the genesis block via the P2P network.
So if the 113GB node is a "full node", where does this 1.6TB figure come from? It happens when you disable the pruning feature on the Geth or Parity Ethereum implementations. In Geth you disable pruning with --gcmode=archive. And what is the pruning feature? Pruning garbage-collects stale data from the Ethereum database (which consists of many trie data structures). More concretely, it deletes old revisions of data (e.g. Alice's account balance changed from 5 to 6, so we can delete the record of it being 5). Pruning data is not mutually exclusive with full validation.
So, why would someone disable the pruning feature? They wouldn't, unless they need access to the state of things from the past (data analytics or some dapps).
But pruning was not always possible for Ethereum. In early implementations Ethereum node users just suffered through unnecessary disk bloat until Vitalik wrote a blog post about it in June 2015 and then Geth and Parity eventually implemented pruning (Geth in Feb 2018).
Those early nodes with no pruning had all previous states of the network. These nodes are now called "archive nodes". They had separate databases (trie data structures) for each block (6.6 million blocks currently). See the "initial state pruning" section.
Since validating transactions only requires knowing the state of the world at the block at which is being validated, maintaining historical state outside of that block is not necessary for validation. Despite this, non-archive Geth nodes will still maintain state for the last few hundred blocks in memory and checkpoints of state at points in time on disk. Why? In case the chain reorganizes and state needs to be looked up for a particular account at an earlier block height. The historical state of the last ~1024 blocks is considered enough state to keep to tolerate most reorgs.
Can specific old state, before those 1024 state tries, be found without maintaining 1.6TB archive node? Yes, 3 options:
- request specific state from P2P network (fast, not perfectly secure)
- compute the state by processing the blockchain from the last checkpoint state
- if you have no checkpoints, then compute the state by processing the blockchain starting at genesis block up until the block at which you want the state (slow, but perfectly secure)
What does this look like for a developer building dapps?
With a 1.6TB Ethereum node you can do this:
With a pruned node, that previous command would fail unless you specified a block height within ~1024 blocks of the current block.
There is an argument that some dapps (like voting dapps) need all historical states to be functional, and since Ethereum features dapps as a keystone feature, then this 1.6TB full state issue is a real problem. But even if we assume there is a material amount of dapps needing access to historical state, it's currently being addressed (potentially being reduced from 1.6TB to 250GB).
There is an argument that since Bitcoin's 162GB size includes its full historical state, any comparisons with Ethereum should use the 1.6TB figure. This is reasonable, but making the comparison in the first place is a bit sensational since it: 1) ignores the nuance that the full historical state is rarely necessary, and 2) is technical debt that is being addressed.
Where does this leave us? Ethereum's current disk requirements to run a fully validating node are not 1.6TB, they are 113GB. If you choose to run in archive mode for analytics, research purposes, or a specific dapp, there are currently (but being addressed) disk space and disk latency constraints.
There are 4 ways to configure Geth:
- Light client: no validation, requests state from P2P network for checking balances and verifying (350MB)
- Fast node: does not validate data before date of initial sync, validates everything after that, prunes old state in memory, writes checkpoints of state to disk (Geth 113GB)
- Full node: validates everything, prunes old state in memory, writes checkpoints of state to disk (Geth 113GB)
- Archive node: validates everything, no pruning (Geth 1.6TB)
Monitor Geth's non-archive size here.
All in all, Ethereum is easier to run than you think. Fully-validating Ethereum nodes need 113GB of disk, or 1.6TB if all old state is included. Fortunately, all old state is rarely needed, but there is work being done to bring it from 1.6TB to 250GB.