It's hard to argue that the data engineering culture has become ingrained in the Web 3.0 developer community. Not every developer can easily determine what indexing means in a Web 3.0 context. I'd like to define some details on this topic and talk about The Graph, which has become the de facto industry standard for DApp developers to access data on the blockchain.

What are indexes in Web 3.0? The ultimate guide. (No prior knowledge required)

Let's start with the index.

An index in a database is the process of creating a data structure that sorts and organizes the data in the database so that search queries can be executed efficiently. By creating indexes on database tables, the database server can more quickly search and retrieve data that matches the criteria specified in the query. This helps improve the performance of the database and reduces the time required to retrieve information.

But what about indexes in the blockchain? The most popular blockchain architecture is EVM (Ethereum Virtual Machine).

The Ethereum Virtual Machine (EVM) is a runtime environment that executes smart contracts on the Ethereum blockchain. It is a computer program that runs on every node on the Ethereum network. It is responsible for executing the code of smart contracts and also provides security features such as sandbox and gas usage control. EVM ensures that all participants on the Ethereum network can execute smart contracts in a consistent and secure manner.

As you may know, data on a blockchain is stored in the form of blocks, which contain transactions. Also, you may know that there are two types of accounts:

Externally owned accounts - described by any ordinary wallet address
Contract account – described by the address of any deployed smart contract

If you send some Ether from your account to any other external owner's account – there is no middleware link. However, if you send some Ether to a smart contract address with any payload, you actually run some methods on the smart contract that actually create some "internal" transactions.

Well, if any transactions can be found on the blockchain, why not convert all the data into a constantly updated large database that can be queried in SQL-like format?

The problem is that you can only access the data of the smart contract if you have the "key" to decipher it. Without this "key", smart contract data on the blockchain is effectively a mess. This key is called ABI (Application Binary Interface).

ABI (Application Binary Interface) is a standard that defines how smart contracts communicate with the outside world, including other smart contracts and user interfaces. It defines the data structure, function signature, and parameter types of the smart contract to achieve correct and efficient communication between the contract and the user.

Any on-chain smart contract has an ABI. The problem is, you may not have the ABI of the smart contract you are interested in. Sometimes, you can find an ABI file (it's actually a JSON file that contains the names of the functions and variables of a smart contract - like an interface for communication).

On Etherscan (if the smart contract has been verified)
On GitHub (if the developer open-sourced the project)
Or whether smart contracts involve any standard types such as ERC-20, ERC-721, etc.

Of course, if you are a developer of smart contracts, you have the ABI because it is generated at compile time.

What is it from a developer's point of view

But let's not stop at the concept of ABI. What if we look at this topic from the perspective of smart contract developers? What is a smart contract? The answer is much simpler than you think. For those familiar with object-oriented programming, here's a simple explanation:

A smart contract in developer code is a class with fields and methods (for evm-compatible chains, smart contracts are usually written in Solidity). Smart contracts deployed on-chain become objects of this kind. Thus, it allows the user to call its methods and change its internal fields.

It is worth emphasizing that any method call that changes the state of a smart contract implies a transaction, usually followed by an event emitted by the developer from the code. Let's give an example of a function call for an ERC-721 (the usual standard for non-fungible token collections, such as BoredApeYachtClub) smart contract that emits an event when ownership of an NFT is transferred.

/**
     * @dev Transfers `tokenId` from `from` to `to`.
     *  As opposed to {transferFrom}, this imposes no restrictions on msg.sender.
     *
     * Requirements:
     *
     * - `to` cannot be the zero address.
     * - `tokenId` token must be owned by `from`.
     *
     * Emits a {Transfer} event.
     */
    function _transfer(address from, address to, uint256 tokenId) internal virtual {
        address owner = ownerOf(tokenId);
        if (owner != from) {
            revert ERC721IncorrectOwner(from, tokenId, owner);
        }
        if (to == address(0)) {
            revert ERC721InvalidReceiver(address(0));
        }

        _beforeTokenTransfer(from, to, tokenId, 1);

        // Check that tokenId was not transferred by `_beforeTokenTransfer` hook
        owner = ownerOf(tokenId);
        if (owner != from) {
            revert ERC721IncorrectOwner(from, tokenId, owner);
        }

        // Clear approvals from the previous owner
        delete _tokenApprovals[tokenId];

        // Decrease balance with checked arithmetic, because an `ownerOf` override may
        // invalidate the assumption that `_balances[from] >= 1`.
        _balances[from] -= 1;

        unchecked {
            // `_balances[to]` could overflow in the conditions described in `_mint`. That would require
            // all 2**256 token ids to be minted, which in practice is impossible.
            _balances[to] += 1;
        }

        _owners[tokenId] = to;
      	
      	emit Transfer(from, to, tokenId);

        _afterTokenTransfer(from, to, tokenId, 1);
    }

We can see it here. To transfer an NFT from your address to any other address, you need to call _transfer function, passing the values of both addresses and the ID of this NFT. In the code, you can see that some checks will be performed and then the user's balance will be changed. But the important thing is that at the end of the function code, there is a line

emit Transfer(from, to, tokenId);

This means that these three values will be "emitted" externally and can be found in the blockchain's logs. It is much more efficient to keep the historical data required in this way because it is too expensive to store the data directly on the blockchain.

Now we have defined all the concepts needed to show what an index is.

Considering that any smart contract (as an object of a certain class) is constantly invoked by users (and other smart contracts) and changes state (while emitting events) during its lifetime, we can define indexing as the process of collecting smart contract data (any internal variables in it, not just those explicitly emitted) during the lifetime of the smart contract, saving this data with the transaction ID (hash) and block number so that any details about it can be found in the future.

It is worth noting that if the smart contract does not explicitly store this data (we know that it is very expensive), then it is impossible to access the first transaction of wallet "A", token "B", or the largest transaction in smart contract "C" (or anything else).

That's why we need indexes. The simple things we can do in a SQL database become impossible in blockchain. Because there is no index.

In other words, "index" here is synonymous with smart contract data collection, because in Web 3.0 no index means no data access.

How developers indexed in the past. They started from scratch:

They write high-performance code in fast programming languages like Go and Rust.
They built a database to store the data.
They set up an API to make the data accessible from the application.
They launched an archive blockchain node.
In the first stage, they traverse the entire blockchain to find all transactions related to a particular smart contract.
They handle these transactions by storing new entities and refreshing existing entities in the database.
When they reach the head of the chain, they need to switch to a more complex mode to process new transactions, as every new block (even the blockchain) can be rejected due to chain reorganization.
If the chain has been reorganized, they need to go back to the fork block and recalculate everything to the new chain head.

As you can notice, not only is it not easy to develop, but it's also not easy to maintain in real time, as each node failure can require some steps to achieve data consistency again. That's why The Graph came along. It's a simple idea that developers and end users need easy access to smart contract data without all this hassle.

The Graph project defines a paradigm called "subgraphs", in order to extract smart contract data from the blockchain, you need to describe 3 things:

General parameters, such as what blockchain to use, what smart contract address to index, what events to process, and which block to start with. These variables are defined in so-called "manifest" files.
How data is stored. What tables should be created in the database to hold the data in the smart contract? The answer can be found in the "schema" file.
How data is collected. What variables should be saved from the event, what accompanying data should be collected (such as transaction hashes, block numbers, results of other method calls, etc.), and how should they be put into the schema we define?

These three things can be elegantly defined in the following three files:

Subgraph. Manifest file
Mode. Graphic-mode description
Mapping. ts - AssemblyScript file

Thanks to this standard, it is very easy to describe the entire index according to any of these tutorials:

How to easily access Tornado Cash data using subgraphs of graphs (https://medium.com/@balakhonoff_47314/how-to-access-the-tornado-cash-data-easily-using-the-graphs-subgraphs-a70a7e21449d)
How to use graph sub-graphs and ChatGPT prompts to access PEPE coin transactions ( https://medium.com/@balakhonoff_47314/tutorial-how-to-access-transactions-of-pepe-pepe-coin-using-the-graph-subgraphs-and-chatgpt-5cb4349fbf9e)
A beginner's guide to getting started with The Graph (https://docs.chainstack.com/docs/subgraphs-tutorial-a-beginners-guide-to-getting-started-with-the-graph)
Explanatory subgraph patterns (https://docs.chainstack.com/docs/subgraphs-tutorial-working-with-schemas)
How to Enable Python to Access Real-Time Smart Contract Data (https://medium.com/@balakhonoff_47314/how-to-access-real-time-smart-contract-data-from-python-code-using-lido-as-an-example-38738ff077c5)

More tutorials (https://docs.chainstack.com/docs/chainstack-subgraphs-tutorials) can be found here.

What does it look like?

As you can see here, Graph takes care of the indexing work. But you still need to run a graph node (it's The Graph's open source software). This is another paradigm shift.

Since developers in the past have been running their own blockchain nodes and no longer do so, this hassle is left to blockchain node providers. The diagram shows another architectural simplification. The Graph hosting service looks for developers ("users" here) this way:

In this case, users (or developers) don't need to run their own indexers or graph nodes, but still have control over all the algorithms and won't even go into vendor lock-in, as different providers use the same The Graph description format (Chainstack is fully compatible with The Graph, but it's worth checking this statement with your web 3.0 infrastructure provider). This is a big deal because it helps developers speed up the development process and reduce operational maintenance costs.

But the cool thing about this paradigm is that any time developers want to make their applications truly decentralized, they can seamlessly migrate to The Graph decentralized network using the same subgraph.

What I missed in the previous narrative.

As you may have noticed, The Graph uses GraphQL instead of the REST API. It allows users to make flexible queries on any table they create and easily combine and filter them. Here's a good video that teaches you how to master it. In addition, ChatGPT can also help write GraphQL queries, as I showed in this tutorial.
Graph has its own hosting service with many ready-made subgraphs. It's free, but unfortunately doesn't meet any production needs (reliability, SLA, support), syncs much slower than paid solutions, but can still be used for development. A tutorial on how to use these ready-made subgraphs in Python can be found here (https://medium.com/@balakhonoff_47314/how-to-access-real-time-smart-contract-data-from-python-code-using-lido-as-an-example-38738ff077c5Found it.
If you plan to use subgraphs in production, I highly recommend using some established Web 3.0 infrastructure providers such as Chainstack for cost efficiency as well as reliability and speed.
If you are uncomfortable with subgraph development, do not hesitate to ask any questions in the Telegram chat "Subgraph Experience Sharing" (https://t.me/+HHP9q2gWFGNlNGYy).
In addition, if you have any questions about smart contract development, please join the "Solidity development" (https://t.me/dev_solidity) chat area.
Support the authors and find out more about Web3 development tutorials by subscribing to Medium (https://medium.com/@balakhonoff_47314) or Twitter (https://twitter.com/balakhonoff).

What are indexes in Web 3.0? The ultimate guide. (No prior knowledge required)

What is it from a developer's point of view