In my last [post](https://yifei-huang.medium.com/why-every-data-scientist-should-pay-attention-to-crypto-39b4c25ff319), I argued that with an increasing number of consumer Crypto applications gaining traction, the growing trove of publicly available crypto data will fundamentally change how the next generation of products compete and operate. Effectively leveraging the data asset will be crucial to realizing this future, and data science will play a key role. In this post, I want to provide more background on crypto data – what they represent in the context of crypto applications, what they look like and how to work with them [1]. If you haven’t read my last post, please consider doing so, as it will provide useful context for this post.
Web 2.0 -> Web 3.0
Before diving into the Data, it is useful to contextualize what a decentralized crypto (sometimes referred to as web 3.0) application looks like and how it differs from a web 2.0 application. In a traditional web 2.0 application, users interact with the application frontend via a browser. The frontend translates user requests into queries to the backend APIs. The backend performs the necessary computation to fulfill the requests and persist the relevant data to storage.

In a web 3.0 application, the frontend more or less works in the same way, with the added requirement of a software wallet that helps uniquely identify the user to the blockchain network. The backend servers, however, are replaced by a decentralized blockchain network that functions like a distributed virtual machine. Smart contracts, written in high level languages like Solidity, live on this virtual machine and serve as the backend APIs. With the help of a node on the blockchain network (typically accessed through a node-as-a-service provider), the frontend broadcasts the user requests in the form of transactions to these smart contracts, which executes the invoked logic on the virtual machine. Upon completion, the transaction details and state changes are persisted in the blockchain ledger in a cryptographically verifiable manner.
In an admittedly simplified view, one can think of the blockchain virtual machine as the backend server, smart contracts as the backend APIs, and the blockchain ledger as the storage [2]. However, there are two key differences in Web 3.0 architecture that are worth emphasizing
- Unlike traditional backend APIs, smart contracts are decentralized and publicly accessible. Anyone on the network can see the code, and build new smart contracts or frontends on top of them. No one entity controls access to the smart contracts like the way Facebook does with its APIs. This is sometimes referred to as the composability of smart contracts.
- Every single backend API call (i.e. transaction) is published on the blockchain ledger in a way that is verifiable and effectively immutable. Each transaction record contains detailed metadata on the specific request and the resulting state changes. This level of transparency represents a paradigm shift from web 2.0 applications, and is what makes crypto data such a compelling opportunity.
The anatomy of a transaction
Transactions are central to how crypto applications function and the data that is created, therefore it is important to precisely define how they work. In the broader context of blockchain networks, transactions are atomic units of activity that change the state of the blockchain virtual machine. There are 3 distinct types of transactions:
- Transfer of value in the form of the base currency by one externally owned accounts (EOA) to another, e.g. Emily sends Bob 3 ETH on the Ethereum network
- Creation of a smart contract by an EOA, e.g. Emily commits code to an address on the blockchain, creating a smart contract that enable users to exchange of ETH with BTC
- Call to a smart contract by an EOA, e.g. Bob calls Emily’s smart contract to exchange 15 ETH for 1 BTC
All transactions must be initiated by an externally owned account (EOA), which is a unique blockchain address, controlled by a private key. This typically represents a human user, but can also sometimes be a bot. Smart contracts, once created, are also just accounts with a unique blockchain address. The only difference between a smart contract account and an EOA is that smart contract accounts are controlled by the contract code, rather than a private key.
One notable feature of transactions is that they require payments to the network (or miner nodes to be more precise) called gas fees. A useful analogy for gas on a blockchain network is actual gasoline. Just as gasoline is needed to power vehicles, gas is required to run code on a blockchain network. Gasoline quantity is measured in volume metrics like liters, and price per unit is measured in fiat currency like dollars. Ethereum gas is measured in quantity units called gas, and price per unit is measured in wei, which is 1/10^18th of an Ethereum. When initiating a transaction, the EOA must specify the amount of gas it is willing to pay to the network for executing the transaction. If the specified amount is insufficient, then the transaction will fail and all staged state changes are reverted. This serves two primary functions
- Incentivize miner nodes to participate and run code on the network. The gas prices fluctuate based on supply and demand of computational capacity, similar to the concept of surge pricing in ride sharing
- Dis-incentivize bad actors who may want to spam the network
When user makes a request in a crypto application, what happens underneath the hood is:
- The EOA associated with the user initiates a transaction that specifies the target smart contract address, the target function, the arguments for that function, the transaction payment (if any), and the gas fee that it is willing to pay
- The transaction is broadcast to the network and picked up by a willing miner who executes the specified function in the target smart contract
- If execution is successful, the smart contract emits events that mark the completion of certain milestones. The resulting event data structure is called logs.
- The target smart contract may initiate internal transactions (additional calls) to other smart contracts. These internal transactions create data structures called traces, and may also emit additional log events during their respective executions.

To make this more concrete, let us take a closer look at an example of a transaction to purchase a Bored Ape NFT on the Opensea exchange smart contract

In this transaction:
- The buyer EOA initiates the transaction with a call to the
atomicMatch_
function in the Opensea Exchange smart contract - The exchange contract verifies that the order bid matches the ask of the sellers, then emits the
OrderMatched
event signifying that the order is confirmed - The exchange contract initiates an internal transaction to the Bored Ape NFT contract to transfer the NFT from the seller to the buyer, which in turn emits an
Approval
and aTransfer
event upon completion - The exchange contract then initiates another internal transaction to transfer the funds, paid by the buyer EOA when initiating the original transaction, to the seller EOA
At the completion of this sequence, the transaction, traces from internal transactions, and logs from events are all persisted to the blockchain ledger. As this example hopefully made clear, the data exhaust from transactions provide very granular details about the inner workings of the crypto applications and the economic activity they facilitate.
Data Structure
Now that we understand the data elements that are created by the crypto applications, and what they represent in reality, let us take a look at what this data looks like. The transaction and trace data structures contain details of the smart contract function call, in particular
hash
: unique id of the transactionfrom_address
: the initiating EOA addressto_address
: the target smart contract addressinput
: hexadecimal encoded representation of the target function and arguments for that functionvalue
: the transaction value or payment
Example transaction
hash: 0xfdf4e500eeefa5b12d773fb74d55c4bbfc92a4297cddc8f85b937978a3fc6477
nonce: 232
transaction_index: 19
from_address: 0xfc7396fc573e916dc0d7203b0f087ffc46882c17
to_address: 0x7be8076f4ea4a4ad08075c2508e481d6c946d12b
value: 0E-9
gas: 74902
gas_price: 52545827339
input: 0xa8a41c700000000000000000000000007be8076f4ea4a4ad08075c2508e481d6c946d12b000000000000000000000000fc7396fc573e916dc0d7203b0f087ffc46882c170...
receipt_cumulative_gas_used: 1242267
receipt_gas_used: 74902
receipt_contract_address: None
receipt_root: None
receipt_status: 1
block_timestamp: 2021–08–10 04:18:55
block_number: 12995203
block_hash: 0x7ca2ff7158d7a40997a5230e39f8d96ad17cf59ced6b27a3288653f9c94ce7a3
max_fee_per_gas: None
max_priority_fee_per_gas: None
transaction_type: None
receipt_effective_gas_price: 52545827339
The log data structure contains details of the events that were emitted during the execution of the smart contract function, in particular
transaction_hash
: ID of the transaction that the event was a part ofaddress
: address of the smart contract that emitted the eventtopics
: the function that emitted the eventdata
: event metadata
Example log
log_index: 260
transaction_hash: 0x2ac3648d5a0a7c1dd58685fabb5c5602add36f1555b1001cb900ea0410ab23db
transaction_index: 131
address: 0xff64cb7ba5717a10dabc4be3a41acd2c2f95ee22
data: 0x000000000000000000000000000000000000000000054e0ee097e3dbdbde51b2000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011d65da6b52d881dd
topics: [
'0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822',
'0x00000000000000000000000003f7724180aa6b939894b5ca4314783b0b36b329',
'0x000000000000000000000000e83df6e24de6d5d263f78ad281143f184a6c95eb'
]
block_timestamp: 2021–08–04 06:40:42
block_number: 12957113
block_hash: 0x3ea06d9b495dffb2804a5f62ee0182949afac9f82d6ba922567fa3a95efc3d86
The astute reader will notice that many fields that are long hexadecimals that are not human friendly. In order to parse out the encoded information, we need to decode these using something called application binary interface. I will discuss this in more detail in the next post.
Tools for accessing and working with crypto data
Now that we have a good understanding of what crypto data represents and what it looks like, how do we actually access and work with it? Luckily there is an array of great tools to help us with just that.
Block explorers
Block explorers are great resources for examining individual transaction details on a given blockchain. The builders of block explorers have extracted and indexed the entire blockchain ledger, and created fast web interfaces to help users easily look up any transaction. See the screenshot below for an example. All major blockchains have explorers – prominent examples include Etherscan, Polygonscan, BSCScan, Solana beach

While block explorers are great for interrogating individual records within the blockchain ledger, they are not great for answering questions that require aggregation or transformation of the data. For example if you wanted to know how many NFTs were sold through the Opensea exchange in the last 3 months, it would be very difficult to answer that with just block explorers. For that you will need direct access to the data.
Getting the data
One way to directly access the data is to query the blockchain yourself. There are various open source utility packages that are available in Python and Javascript to help make this process easier. For example
- For the Ethereum blockchains, and EVM compatible chains like Polygon and BSC, you can use the Web3 package.
- For Solana, you can use the Solana Py
With these utility libraries, you will be able to programmatically interact with the blockchain of interest, to query for data, submit transactions, and even deploy smart contracts.
There are also open source projects that have packaged the above building blocks together into full ETL pipelines to help you download all the granular data into your own environment. Furthermore, the owners of these projects have also published many of the raw datasets into public datasets on Google cloud, which offers a relatively easy-to-use SQL interface to query the data.

Last, but not least, Dune analytics is another great resource for accessing and analyzing blockchain data. It has both raw and decoded data for Ethereum, Polygon, Optimism and BSC as of the writing of this post. This is a very differentiated offering compared to the public dataset on Google, because the decoding makes the hexadecimal encoded data fields human readable. It offers a Postgres interface for querying the datasets and a simple point and click interface for creating simple dashboards on top of the query results. The community of users on Dune is also quite active and has generated an extensive library of example queries and dashboards to learn from. Here are a couple of examples analysis that I have created on Dune
Key takeaways
- Crypto data is the exhaust from the web 3.0 application architecture
- It contains full history of all the "backend API calls" in a crypto application, in the form of transactions, logs and traces
- These data structures contain granular details of the user request and application state changes
- There are a variety of freely available tools to help us access and analyze this treasure trove of data
Hopefully this was a useful discussion and I have helped you gain better intuition about what crypto data is, what it looks like and how to work with it. In my next post, I will provide a tutorial on how to decode crypto data and make them more human readable, as a precursor to deeper investigations of popular crypto applications like Opensea and Uniswap. Be sure to hit the email icon to subscribe if you would like to be notified when that posts.
Thank you for reading and feel free to reach out if you have questions or comments. _Twitter | Linkedin_
[1] This discussion will be using the Ethereum blockchain as the primary reference architecture. Specifics may vary for other blockchains like Solana, but many of the concepts will generally.
[2] This is a simplified illustration of a web 3.0 application that glosses over some implementation details. For a more thorough review, the user is encouraged to study this excellent deep dive.