<  Blog

How to Extract On-Chain Bitcoin Data

Realtime on-chain data can be difficult to obtain, as evident by the endless number of blockchain explorers, and data providers. As a consequence, answering simple questions, eg: “what number of Bitcoin wallets have more than 100 BTC?” is challenging.
March 30, 2022
in
Tutorials

Realtime on-chain data can be difficult to obtain, as evident by the endless number of blockchain explorers, and data providers. As a consequence, answering simple questions, eg: “what number of Bitcoin wallets have more than 100 BTC?” is challenging. Services such as Glassnode provide these on-chain metrics - but not without a steep fee. So what if you want to calculate these metrics for yourself or your business? In this blog post we will define the path you can take to calculate bitcoin data through a walkthrough of an example metric: What is the number of Bitcoin wallets with more than ‘X’ BTC?

Before we delve into how to extract on-chain data, it’s worth taking a moment to understand how transactions function on the blockchain. Bitcoin is designed to work as a transactional-ledger, as opposed to the more common account/balance-ledger. Bitcoin has no accounts in the familiar sense; instead it keeps track of a list of transactions that have not been spent yet, known as UTXO (unspent transaction output). The balance belonging to each address is the sum of the values of all UTXO belonging to that address. This concept is better illustrated with an example.

How Blockchain Transactions Function

For simplicity's sake, we can imagine a basic example of a blockchain with only two participants, Alice and Bob. Imagine now, that right at the beginning, there is only Alice. As the first and only user, she mines (builds) the genesis (first) block. The genesis block consists of only one transaction, referred to as the coinbase transaction. This coinbase transaction consists of two parts, an input and an output. The input is empty, but the output is the mining reward which we will take as 50 BTC.

Since Alice is the miner she gets to construct the coinbase transaction. She writes the output with a value of 50 BTC to herself, so only her private key can unlock it. The UTXO list at this point only consists of the genesis block's coinbase transaction, and is only spendable by Alice.

Chugging along, Alice mines 9 more blocks and accumulates 500 BTC. Eventually desiring more users on her blockchain, she promises her friend Bob 75 BTC if he joins. Bob joins the network and since Alice is the only current user, he downloads the blockchain data from her. As promised, Alice pays Bob 75 BTC for joining.

Now, at this point, Alice has accumulated 500 BTC in the form of 10 UTXO with a value of 50 BTC each. For Alice to pay Bob, Bob must first provide Alice his public key, which she’ll use to send the payment to. With Bob's public key, she can construct a transaction that pays Bob his 75 BTC. To do this, she looks into her UTXO list and gets 2 UTXO with a value of 50 BTC each, which will form the input of the transaction. And then having "unlocked" the input, she will construct two outputs, one to pay Bob’s public key with a value of 75 BTC, as well as the change with a value of 25 BTC, which she will send back to her own address.

After this transaction, two UTXO have been spent. From these two spent UTXO, two new UTXO are generated: one to pay Bob and the other for Alice to send the change back to herself. Now Alice's UTXO list consists of 8 UTXO with a value of 50 BTC, and 1 UTXO with a value of 25 BTC, for a total of 425 BTC. Meanwhile, Bob has one UTXO for a total of 75 BTC.

How to Calculate Wallet Balances

Now that we have an understanding of how Bitcoin transactions work, we know that to calculate the balances of all the wallets, we need the most recent list of UTXO. Thankfully, the Bitcoin Node keeps track of this. As a first step, we will need to run Bitcoin Node on our machine, which currently stands at a whopping ~452 GB. The UTXO list is found in the ~/.bitcoin/chainstate/ directory as a bunch of LevelDB files. We will need a way to access the LevelDB as well - Plyvel is a great Python package for this.

Next, before trying to read the files, it’s a good idea to stop the bitcoin node (bitoinc-cli stop) and copy the chainstate file somewhere, to avoid accidentally corrupting it. We can access the chainstate database by running:

import plyvel
def read_chainstate():
    db = plyvel.DB('~/.bitcoin/chainstate', compression=None)

    for key, value in db.iterator(prefix=b'C'):
        tx_key = key[1:]
        print(f"key: {tx_key}"
        print(f"value: {value}")

    db.close()

A single entry will look something like this:

"key":"43000006b4e26afc5d904f239930611606a97e730727b40d1d82d4f3f1438cf2a101",
"value":"71a9e87d62de25953e189f706bcf59263f15de1bf6c893bda9b045"

Every entry in LevelDB is obfuscated to avoid being flagged by antivirus software. Details including how to deobfuscate entries can be found in the source code. In addition, the Bitcoin developer network has an excellent article delving into more details. Finding the UTXO list is a matter of iterating through every UTXO, deobfuscating, and saving into a dataframe or a CSV file for further analysis. A single deobfuscated entry might look like:

id, height, vout, value (BTC),
326e5ce33fa7c5a02fda343594038f88f9666b120e7c33968f83dea3847ce502, 729440, 0, 0.00190000, \
address
3Ayb58kiMBnjKjU2r6xgh8HMTZ9VbnwfRe

Finally, we find the balances of all the wallets with non-zero BTC by grouping by the address and summing all the values!

You can use the Shakudo Platform to automate this project, as well as for other blockchain and web3 solutions. Shakudo combines a fleet of open source tools and modern data frameworks into an intuitive, end-to-end data project platform. Try it for yourself by booking a demo with us.

Mohamed is a Machine Learning Engineer with experience in backend development. Mohamed graduated from the University of Waterloo with a Master's degree in physics.