laitimes

In-depth investigation: Why do the new public chains frequently have downtime accidents?

In-depth investigation: Why do the new public chains frequently have downtime accidents?

Entering January, Solana, Harmony, Arbitrum and other public chains/Layer2 have stopped blocking network downtime, and the Ethereum side chain Poygon has serious congestion, and users have reported that they cannot initiate transactions or withdraw coins for a long time.

Most of the above-mentioned public chains take "high performance" as the main slogan, but they invariably start a "strike" in a similar period of time. Prior to this, Solana, Arbitrum, BSC, and Fantom had also exposed similar problems many times.

The collective shutdown of the new public chain reflects a widespread and far-reaching infrastructure crisis. Chain Catcher tried to restore the crisis by interviewing the Harmony team involved, as well as domestic public chain professionals such as Confflux, and clarified issues that deserve attention and deep consideration.

Author | Richard Lee

Edit | Gong Tsuen-woo

01

Why is "public chain downtime" worth paying attention to?

Web 3.0 is known for combining the openness of Web 1.0 with the economic benefits of Web 2.0, and is the cryptographic circle's collective name for the next generation of the Internet wave. This old term has become a buzzword again, not only because of the legitimacy it gives to the crypto economy, but also because it symbolizes the mass adoption of blockchain and crypto technology.

The public chain track ushered in explosive development in 2021, and the appearance of Solana is one of the reasons: the so-called tens of thousands of TPS per second, which is committed to bringing users a faster and cheaper on-chain experience. Celebrities or institutions such as SBF and Bank of America see Solana as a "gateway" to promote large-scale encryption adoption.

In the future, the application on the chain is expected to further go out of the circle, and the public chain as the lowest infrastructure, its security and stability are crucial. The new public chain represented by Solana plans to challenge Ethereum and become the first stop for many new users to enter the encryption industry, but they have encountered embarrassing conditions such as downtime, reflecting that these new public chains have gradually exposed their shortcomings in the process of rapid development.

If the above-mentioned phenomenon of the public chain network being paralyzed for several hours cannot be solved in time, it is bound to bring a bad user experience and impression to the new mainstream users, and become an important bottleneck restricting the large-scale development of the crypto economy.

After all, as a decentralized network maintained by distributed nodes, if the public chain is still as frequently down and stuttered as the platform based on the centralized server, how can it convince the mainstream population?

02

Out of control traffic: the root cause of the "shutdown" of the new public chains

"DDoS attack" is one of the most common terms used by project parties to explain the degradation of network performance. The full name of a DDoS attack is "distributed denial-of-service attack", which refers to the use of traffic from multiple sources to get traffic out of the processing range of the system, so that real users can not get the required network services or resources in time. Attackers typically achieve results by sending traffic to a network that exceeds the processing capacity of the network card, or by sending an application the number of requests that exceed its ability to manage.

According to Halborn, a blockchain white hat hacking group, traditional DDoS practices typically cause a fixed single point of failure in the system, such as a web server failure, and visitors may not be able to access the website operated by it. Therefore, resistance to DDoS attacks is often one of the main selling points of blockchain technology - no node in the blockchain network is essential, and a single node offline does not cause the entire network to collapse.

However, this does not mean that blockchain is immune to DDoS. Halborn pointed out that attackers can send a large number of spam transactions that flood the entire blockchain network, thereby reducing the use of "legitimate users" and spatial resources. In real scenes, the so-called "attack" is usually not a real premeditated "attack", but a cheating behavior carried out by real users with the help of computer programs "open and hang" under the hot project IDO, GameFi transactions or market conditions.

So, can continuously increasing the memory capacity of the node server solve this problem? The answer is no. This is determined by a feature common to most blockchain networks: most blockchains have a fixed capacity, they regularly create blocks with specific size limits, and when nodes package blocks, anything that is not suitable for the current block is stored in the "memory pool" waiting for the next block to be packaged.

Therefore, this fundamental property also determines the common problem that public chain networks all have to face: in special cases, it may trigger a flood of transaction requests.

How to deal with this problem, and whether the response measures are effective, are important indicators of the recent performance of major networks.

Solana users are probably most familiar with the "transaction flood" experience. Dating back to September 14 last year, Solana's entire network was interrupted for 17 hours, all on-chain services could not be used, and the official follow-up report said that due to the IDO activity of the decentralized social networking protocol Grapee Protocol on the Raydium platform, many users sent a large number of transactions through machine scripts written, which caused "memory overflow", causing the verification node to crash, and finally the entire network could not reach a "consensus" and went offline (that is, unable to produce new blocks).

In-depth investigation: Why do the new public chains frequently have downtime accidents?

According to the solana Status announcement, the congestion on the Solana network that has continued since the beginning of December last year is also related to the problems exposed by the "9.14" downtime event. Solana Status is a Twitter account operated by the Solana Foundation that publishes web performance announcements.

According to the analysis of blockchain company Laine, the recent market volatility is large, and many leveraged positions in DeFi projects have reached the liquidation standard. The person who performs the DeFi liquidation receives a reward and anyone can apply to act as a liquidator. So it also created a market where many people competed for the bounty, many of them using self-developed automated programs (commonly known as "bots"), and to ensure that they could "win" the game, these "bots" would send dozens, if not hundreds, of the same trade requests.

"We see close to 2 million transactions (transactions or other types of requests) arriving at the same node every second, with more than 90 percent of them being exactly the same duplicate." Solana co-founder Anatoly Yakovenko said at a Twitter Space event in the early hours of January 27.

In response to the cause of the downtime, Hu Zhiwei, president of the Boundary Intelligence Research Institute, further told the chain catcher that because Solana also passes the consensus message as a special transaction message between the verification nodes, a large number of message blockages lead to the consensus message cannot be delivered normally, and the consensus cannot be carried out normally.

In-depth investigation: Why do the new public chains frequently have downtime accidents?

Structural composition of Solana TPS Source: Solana Beach

"At the same time, some of Solana's features have been targeted and led to network downtime. For example, write-locks (write-locks) for concurrent processing transactions are locked on many important addresses, so that transactions become sequential execution rather than concurrent, which greatly affects the processing ability of messages; nodes continuously retain possible forked information for processing forks, resulting in memory overflows, etc." Hu Zhiwei said.

Wu Ming, CTO of the well-known domestic public chain Confflux, analyzed to the chain catcher that in the case of too many transactions in the Solana network, the forwarding (broadcasting) delay of the block will increase, and the ledger will be prone to forks; when the ledger fork is serious, the pressure of the consensus algorithm will increase, and if it is not handled well, it will eventually lead to a complete collapse of the system.

"A very important problem here is that nodes should not unrestrainedly forward low-cost garbage transactions, and Solana should not do a good job of flow control (flow control) in this regard." Wu Ming said.

Anatoly Yakovenko also acknowledged the issue at the aforementioned Twitter Space event. The main problem, he said, was that in the original program design, the "duplicate transaction check" was carried out after the signature verification, so all duplicate data must be verified by the signature before it can be checked for "junk transactions". In addition, Solana's program for de-duplication and network redundancy ran very slowly, taking hundreds of microseconds before the node client upgrade.

To avoid "bot" trading from disrupting the network again during the next big market, Anatoly Yakovenko said that "actual flow control" will be introduced in version 1.9 of the Solana mainnet beta.

Another popular public chain, Harmony, faces a similar problem. On January 15, the Harmony network was disrupted for several hours, and the team officially raised the basic gas fee to 30 gwei to raise the threshold for sending spam transactions.

The post-mortem analysis released by the Harmony community shows that the leader node of the network receives a large amount of spam traffic, coupled with the poor handling of high traffic by the old version of the client that verifies the node, the combination of internal and external factors has led to this "downtime" accident.

Harmony CTO Rongjian Lan told the chain catcher that the repeated sending of point-to-point network (p2p) packets caused congestion in the p2p network, and normal consensus messages could not be sent smoothly, so the network could not reach a "consensus". The internal reason is that there are potential bugs in the parameters of the Harmony p2p network, which occurs.

"The new Web3 infrastructure needs better traffic monitoring and traffic restriction mechanisms to prevent network abuse." Rongjian Lan said that after Harmony optimizes the parameters of the p2p network protocol layer, it will carry out long-term system improvement engineering, optimizing at the consensus, network and RPC layers.

In addition, The Ethereum second-layer expansion network Orbitrum One had network outages on September 14 last year and January 9 this year, respectively, but from the official announcement, this is not directly related to the loss of traffic control, mainly related to the high degree of centralization that the network deliberately maintains because it is still in the testing stage.

It is reported that the first accident of Arbitrum One was due to a bug in its Sequencer, and the most recent drop was due to a hardware failure of the main Sequencer node, and the backup Sequencer failed to take effect in time, resulting in a network "strike" for several hours.

"While we usually have redundancy that allows backup sequencers to be controlled seamlessly, these features have not taken effect due to ongoing software upgrades. As a result, Sequencer stops processing new transactions." Offchain Labs said.

It is reported that the Sequencer is a full node operated by The Orbitrum development team Offchain Labs. Sequencers have the privilege of controlling the ordering of each transaction in the Inbox to ensure that the user's transaction results are immediately determined.

Offchain Labs said in the above announcement that once Arbitrum is fully decentralized, the strongest guarantee will arrive.

03

Is raising the threshold for "evil" the ultimate solution? Where is the future of public chain stability?

In fact, under the incentive of certain motivations, writing scripts and "opening and hanging" cheats have been the natural behavior of Internet users for a long time, and with the increase of on-chain interactions, the "transaction flood" and "robot" troubles will inevitably enter the blockchain space.

In the same period, the network operation status suffered "bad reviews" of the Polygon network. At the beginning of January, due to the popularity of the P2E game Sunflower Farmers on Polygon, participating players sent a large number of transaction requests, and the smart contract Gas consumption of the chain game in a short period of time once accounted for 41.8% of the entire Polygon network, resulting in other types of transactions on Polygon being temporarily shelved, the network was highly congested, and the average Gas price rose nearly 7 times in a few days.

In-depth investigation: Why do the new public chains frequently have downtime accidents?

Polygon average Gas price trend in the past three months Source: Polygonscan

Polygon has long been plagued by "transaction flooding", and network congestion occurs from time to time. Previously, in October last year, Poygon had raised the minimum gas price of node clients by 30 times (from 1 Gwei to 30 Gwei) to cope with the massive "junk transaction".

This response is consistent with Harmony's contingency measures. However, raising the price of basic gas on the one hand increases the cost of users to "open and hang", on the other hand, it will also have an impact on the user experience.

For this habitual operation of the project side, Wu Ming analyzed the chain catcher and said that it is certainly effective to increase the basic gas as a "flow control" method, and the essence of this measure is to reduce the throughput rate that the system can support.

But he also pointed out that if you want to do better, you need to work the system itself to improve the maximum throughput rate that the system itself can support, which will involve improvements in consensus algorithms, network forwarding algorithms, storage and execution optimizations, and so on.

Solana co-founder Anatoly Yakovenko disclosed that the "flow control" improvement involves the introduction of new protocol mechanisms. Anatoly Yakovenko said that the new upgrade will introduce a qos flow control mechanism by staking weights, which is implemented by the "Quic Protocol", which is said to have been developed by Google for 5-6 years. With this protocol, Solana can impose "rating" restrictions on senders.

Deciding how to allocate bandwidth between different blocks is the development team's most important proposition — a process that requires validators to receive message flows from the rest of the network and prioritize quality of service and congestion control based on the source weights of these messages.

Anatoly Yakovenko said on Twitter that the above "flow control" measures will be rolled out in the next 4-5 weeks.

Hu Zhiwei said that for traffic attacks, the public chain can also take network traffic protection measures for validators, such as the use of sentinel nodes (Note: it can achieve master-slave switching through a series of mechanisms when the master node fails, and realize the failover of the node). For TPS higher solutions, in addition to optimization in this chain, you can also consider the extended processing of cross-chain + application proprietary chains.

And this is also the solution that BSC is exploring. Recently, the BSC officially acknowledged in its annual summary that there are many challenges to its operating mechanism, including "network congestion and node operators face difficulties in managing their full nodes to synchronize with the latest blocks", which led to several short-term downtimes in the past year.

In this regard, BSC said that this is because the setting of large blocks causes the verification node to need more storage space and time to synchronize blocks, and will develop to multi-chain and cross-chain in 2022, and launch BSC application sidechains (BAS) and BSC partition chains (BPC) to reduce the amount of data storage in the main chain.

In-depth investigation: Why do the new public chains frequently have downtime accidents?

BSC's technology plan for the year Source: BSC Blog

Can technological improvements and increased degrees of decentralization ensure the stability of the operation of the public chain network?

In response to this problem, some netizens have followed the "impossible triangle" of blockchain "scalability" and put forward the choice dilemma of "transaction quality": between spam, censorship resistance and low fees, to achieve the second, the remaining goal must not be achieved.

In-depth investigation: Why do the new public chains frequently have downtime accidents?

Whether this is the case is actually unknown until the above-mentioned project team implements improvement measures.

But in any case, the phenomenon of public chain downtime has given enlightenment: for a long time in the future, the public chain as the underlying infrastructure is still in its early stages, and it needs to meet more tests in terms of network stability and ecological perfection, especially the need to take more measures to deal with special situations such as the surge in transactions to avoid negative impacts on the experience of ordinary users.

(Loners Liu/Hunter He also contributed to this article))

Read on