This article is not a one-size-fits-all approach, given many factors such as the nature of business needs, the type of hosting infrastructure (e.g., cloud or on-premises), cost, and many others. Just some general lessons learned.

What does high concurrency mean?

High concurrency refers to the fact that the system processes a large number of requests or tasks at the same time in the same time period. This includes a large number of data reads, data writes, and computing tasks.

Characteristics: Instantaneous, high volume of requests. The system needs to process a large number of requests at the same time and in a short period of time, which may be thousands or even millions of concurrent requests, in the millisecond range.

In the face of high concurrency, the system needs to carry out reasonable architecture design, performance optimization, and resource management to ensure that the system can effectively cope with a large number of concurrent requests and provide stable and efficient services.

High concurrency means a storm is coming, which is very challenging for our experience.

A metric that measures concurrency

In 2016, "Double 11" Alipay peaked at 120,000 payments per second.

In 2017, 760,000 red envelopes were sent and received on WeChat during the Spring Festival.

These are all things that describe the degree of concurrency.

The degree of concurrency is generally described by the following metrics:

Throughput: Throughput refers to the number of tasks that the system can handle per unit of time. This is an important measure of the concurrent processing capacity of a system. In general, the higher the throughput, the more concurrent the system has.
Response Time: Response time refers to the time it takes for the system to respond to a request from the time it is received. In high-concurrency environments, response times can increase, so keeping them low is key to improving the user experience.
Concurrent Users: The number of concurrent users refers to the number of users that are being served by the system at the same time. This is another important metric to measure the concurrent processing capacity of a system.
Concurrent Requests: The number of concurrent requests is the number of requests that the system is processing at the same time. This is an important performance metric for a web server or application server.

Problems and challenges

Almost all systems face a storm, and as the traffic increases, there will be all kinds of technical problems, and if not, it may just be that your concurrency is not high enough. It still faces common problems under ultra-high concurrency, which will be characterized by the following situations from the perspective of the system as a whole:

Performance issues, such as access timeouts, CPU load increases, frequent GCs, deadlocks, and large data storage.

Reduced system availability can lead to a large amount of 5xx, increased database pressure, and even crashes

Data consistency

These problems and normal business are gradually complex, in addition to the security, cost, etc., the complexity of the system will increase geometrically, so how to comprehensively solve these complex problems, we must jump out of the code, have a comprehensive thinking, and analyze and solve these problems from the level of architecture.

Here we explain how to solve the complexity of the gao'bing'f system from the global system and the existing model under the infinite constraint of macro architecture thinking.

How to improve performance

Performance problems are a comprehensive problem, just a symptom, along with the appearance of performance problems, generally appear memory, CPU overload, network bandwidth insufficiency, etc.

Therefore, from the architectural point of view, the general solution mode can be summarized in three points

Division, division, splitting, distributed clustering, reasonable planning of capacity or support for elastic scaling, to avoid overload caused by insufficient resources at a single point
Resource reuse or reuse, such as caching and pooling
Deferred processing, such as asynchronousization

There are also some optimization logic or technical implementation methods, such as merging requests to reduce the number of IOs, IO multiplexing, multithreading, index optimization, etc.

Performance optimization principles

Problem-oriented, don't optimize too early, avoid increasing the complexity of the system, and waste R&D manpower
Follow the 28 principle to grasp the main contradiction and prioritize the optimization of the main performance bottleneck
Optimization needs to be backed by data Always know how much your optimization has reduced response time and increased throughput. You can use the average, extremum (maximum/minimum), quantile value, and so on as eigenvalues for statistics

Separation

The purpose of partition and rule is twofold, one is that human thinking is limited, and the other is that the performance of a single machine is limited, although since the birth of the first computer, it has been able to support several times per second to hundreds of millions of operations per second, and modern computing clustering is an inevitable situation.

Whether it is micro process splitting thread to multi-process, multi-threading and other technologies, or macro from monolith to SOA to microservices, this idea is actually followed.

System architecture patterns in ultra-high concurrency scenarios

This solution at the macro level is effective, but it comes with a lot of complexity:

How each sub-problem is assigned to a different machine
Splitting logic for sub-problems

For the first question, it is natural to use a scheduler for machine allocation.

In web development, the scheduler is basically equivalent to the role of a load balancer, and there are many types and algorithms of load balancer, including:

Hardware load balancing and software load balancing. Hardware load balancing usually refers to dedicated load balancing devices, such as F5, Radware, etc., and software load balancing refers to load balancing based on software, such as Nginx, HAProxy, etc.
Load Balancing Algorithm:
轮询（Round Robin）
加权轮询（Weighted Round Robin）
最少连接（Least Connections）
加权最少连接（Weighted Least Connections）
Random
Hashing vs. Consistent Hashing

High-performance databases

In Tencent Cloud MySQL 8.0, only some data can be placed in the cache, and the query process requires reading and writing to disk to update the benchmark test in the cache scenario.

CPU	Memory (MB)	Concurrency	The amount of data in a single table	Total number of tables	SysBench TPS	SysBench QPS	avg_lat
1	1000	8	800000	6	582.76	11655.1	13.73
1	2000	8	800000	12	588.92	11778.4	13.58
2	4000	16	800000	24	899.06	17981.2	17.8
4	8000	32	800000	48	1915.83	38316.6	16.7
4	16000	32	6000000	13	1884.2	37684	16.98
8	16000	64	6000000	13	3356.71	67134.2	19.06
8	32000	64	6000000	25	3266.73	65334.6	19.59
16	32000	128	6000000	25	5370.18	107404	23.83
16	64000	128	6000000	49	5910.85	118217	21.65
16	96000	128	6000000	74	5813.94	116279	22.01
16	128000	128	6000000	98	5700.06	114001	22.45

The performance of a single MySQL node is limited, and in a high-concurrency scenario, it is difficult for a single MySQL node to cope with a large number of requests to access the database. Therefore, according to the above theory, the data is clustered from two aspects to improve performance.

Read/write splitting, the master node is responsible for processing write operations (such as inserts, updates, and deletions) and is the main source of data. Slave: It handles read operations (such as queries), synchronizes data from the master node, and provides services for reading data.

Data consistency issues may be caused by master-slave latency, which requires some trade-offs and optimization strategies.

Database and table sharding is a strategy of horizontal database splitting, which is used to solve the limited storage capacity and performance bottlenecks of a single database. Sharding: Data is distributed to multiple database instances according to certain rules, each database instance is called a shard, and the shard key can be used to determine which shard the data is stored in. Sharding: In each database instance, data is distributed and stored in multiple tables according to certain rules, each table is called a sharding table, and the shard key can be used to determine which sharding table the data is stored in.

Hot and cold separation

Others: Select databases based on cost, scale, and resources, and of course use other NoSQL systems.

Caching

Caching is a necessary and common means to build high-performance systems, and in many scenarios, relying solely on the performance of the storage system is not enough.

Caching can reduce access to the database and reduce network IO, thus achieving the performance of the entire system.

Common components used to build caches include Redis and Memcached. When using caches, take Redis as an example:

Use clusters to plan capacity reasonably and make monitoring plans
Good key naming conventions and lengths
Avoid big keys
Caching policy

When using caching, the entire system architecture becomes more complex, and some new problems may also be encountered, and the following problems and corresponding solutions are solved:

Cache penetration: Queries a non-existent data, and because the data is not in the cache, each query will penetrate to the database, resulting in excessive pressure on the database. Solutions include using a bloom filter to intercept non-existent data, setting up a null cache, using a cache breakdown solution, and so on.
Cache breakdown: A hot data instance suddenly fails, resulting in a large number of requests accessing the database directly, resulting in excessive pressure on the database. Solutions include setting hotspot data to never expire, using mutex locks to update caches, and pre-loading hotspot data.
Cache avalanche: A large amount of cached data fails at the same time, resulting in a large number of requests directly accessing the database, resulting in excessive pressure on the database. Solutions include setting different expiration times, using distributed locks to update the cache, using backup caches, and so on.
Hot issues: Refers to data that is frequently accessed in a system, usually some hot and hot data. Features include: high frequency, important, large volume. The solution ideas include: Warm up the cache: Load the hot data into the cache in advance before the system starts or peak periods, and cache the hot data in advance to improve the cache hit rate. Set an appropriate cache expiration time: For hot data, you can set a long cache expiration time to avoid frequent cache updates and improve the cache hit ratio. Use a cache elimination policy: For hot data, you can use an appropriate cache elimination policy to ensure that important hot data is always stored in the cache. Cache prefetching: Based on the user's access pattern and behavior, the data that may be accessed is loaded into the cache in advance to improve the cache hit rate. Multi-level cache key horizontal splitting
Cached data consistency: In a distributed environment, the consistency of cached data is a challenge, and dirty data can be cached. Solutions include using cache update policies, using cache locks, and using cache expiration notifications.

Ikehua

In high-performance applications, pooling is a common and effective technique for managing and reusing resources to improve the performance and efficiency of the system. Commonly used pooling techniques include several aspects:

Connection pool: Connection pool is a common pooling technique in scenarios such as database access and network communication. Connection pooling creates a certain number of connections in advance, and when database queries or network communication are required, connections can be obtained from the connection pool instead of creating a new connection each time, reducing the overhead of connection creation and destruction, and improving system performance.
Thread pool: In multithreaded applications, thread pools can manage and reuse threads, avoid frequent creation and destruction of threads, reduce the overhead of thread switching, and improve the concurrency performance and response speed of the system.
Object Pool: Object pools are used to manage and reuse object instances, avoid frequent creation and destruction of objects, and improve the memory utilization and performance of the system. Object pools can be used to create complex objects, such as database connections, HTTP connections, and threads.
Memory Pool: A memory pool is a technology used to manage memory allocation and freeing by pre-allocating a contiguous piece of memory space and allocating it to applications on demand, reducing memory fragmentation and frequent memory allocation operations, improving memory management efficiency and system performance.

Server mode

Reactor and Proactor are two design patterns that are commonly used to implement high-performance servers, which are useful when dealing with I/O-intensive tasks to improve the concurrency performance and responsiveness of the system.

Reactor Pattern: The Reactor pattern is an event-driven design pattern that listens to and distributes events through an Event Loop, and when an event occurs, the Reactor invokes the appropriate handler to handle the event. The Reactor pattern is often used to implement event-driven network programming, such as event-based servers.

Proactor Mode: Proactor Mode is also an event-driven design pattern, unlike Reactor Mode, where I/O operations are done by the operating system or framework, rather than by the application itself. The application only needs to be notified when the I/O operation is complete, and then process it accordingly. Proactor mode is typically used to implement asynchronous I/O operations.

Asynchronous

The asynchronous pattern is a way to improve system performance by processing requests asynchronously. In asynchronous mode, instead of waiting directly for the processing result, the request is placed in a queue or other data structure and then returned immediately. In this way, the request processing process does not block the main thread, which increases the throughput and responsiveness of the system.

Asynchronousness is usually decoupled using message queues, and deferred processing improves the throughput of the entire system, which is essentially a technology for deferred processing.

CQRS

Command Query Responsibility Segregation (CQRS) is a software architecture pattern that separates the read operations (queries) and write operations (commands) of data into different models. This separation helps improve the scalability, performance, and security of the system.

The core concept of CQRS is to divide an application into two parts:

Command Model: Responsible for handling data modification operations, such as creating, updating, or deleting. The command model typically modifies the data store and generates the corresponding events. These events can be used to update the query model or trigger additional business logic.
Query Model: Responsible for handling read operations on data, such as fetching data or performing searches. The query model typically takes data from the data store and returns it to the client. Query models can be optimized for specific queries to improve performance.

Event-driven & event-sourced

Event-Driven Architecture (EDA) is a software architecture pattern that enables asynchronous communication between applications, systems, or services through events. In an event-driven architecture, an event is the occurrence of a state change or specific condition that can trigger a response from other components or services. Event-driven architectures allow components or services to be independent of each other, which helps improve scalability, flexibility, and fault isolation.

The main components of an event-driven architecture include:

Event producer: The component or service that is responsible for creating and publishing events. Event producers don't care who subscribes to the events they publish, or how they are handled.
Event consumer: A component or service that subscribes to and processes a specific event. Communication between event consumers and event producers is indirect, typically through event buses or message queues.
Event bus/message queue: Middleware used to transmit events. An event bus, or message queue, is responsible for delivering events from producers to consumers while ensuring that events are delivered reliably, orderly, and securely.

In addition, there are some more microscopic performance improvement technologies, including but not limited to batch packaging, zero copying, encoding, SQL tuning, and lock-free programming.

Availability and auto scaling

Availability refers to the ability of a system to perform its functions without interruption, represents the degree of availability of the system, and is one of the criteria when designing a system.

High availability is generally measured as SLA and several 9 descriptions. The design process for high availability is also a process of trade-offs. That's why system availability is always just a few nines, and that one is always missing.

Basic idea

Architectural aspects

Redundancy, failover recovery, governance

Observational aspects

Logs, indicators, and tracking are all aspects of construction

Tooling

Do a good job of drills, stress tests, fault feedback, etc

CAP Theory

CAP theory is an important theory in the field of distributed computing, which points out that in distributed systems, the three core indicators of consistency, availability, and partition tolerance cannot be optimized at the same time. In other words, when we design a distributed system, we have to make trade-offs among these three metrics.

The three indicators of the CAP theory are defined as follows:

Consistency: Consistency means that all nodes in a distributed system have the same copy of data at the same time. To put it simply, when a client writes to data, other clients can immediately read the latest data.
Availability: Availability refers to the ability of a distributed system to provide services to clients in both normal and failed situations. In other words, when a client makes a request, the system always returns a response within a limited amount of time, whether it succeeds or fails.
Partition Tolerance: Partition tolerance refers to the ability of a distributed system to operate normally even when network partitioning (i.e., communication failures between nodes) occurs. In this case, the system may be inconsistent or unavailable, but it will do its best to keep it up and running.

According to the CAP theory, we can divide distributed systems into the following three types:

CA (Consistency and Availability): This type of system guarantees consistency and availability without network partitioning. However, when there is a network partition, it can be difficult for such systems to maintain normal operation. Traditional relational databases, such as MySQL and PostgreSQL, often fall into this category.
CP (Consistency and Partition Tolerance): This type of system guarantees consistency when network partitioning occurs, but at the expense of availability. When a network partition occurs, these systems may reject a subset of requests to ensure data consistency. Distributed databases such as ZooKeeper and Google Spanner fall into this category.
AP (Availability and Zoning Fault Tolerance): This type of system guarantees availability when network partitioning occurs, but at the expense of consistency. When a network partition occurs, such systems may return stale data to guarantee responsiveness to clients. NoSQL databases such as Cassandra, Couchbase, and Amazon Dynamo fall into this category.

In other words, in practice, CA is basically not guaranteed.

The trade-off between C and A can occur repeatedly at very small granularity within the same system, and each decision may vary depending on the specific action or even because of the specific data or user involved.

BASE

BASE stands for Basically Available, Soft State, and Eventual Consistency, and the core idea is that even if strong consistency cannot be achieved (CAP consistency is strong consistency), the application can achieve eventual consistency in a suitable way.

It is mainly an extension of the AP solution and provides us with ideas.

Joe Yu

Multi-machine high availability

The design goal is to continue to function normally when some hardware is damaged.

The commonly used dual-machine redundancy architecture includes master/standby, master/slave, cluster, and partition

Master/Standby/Master/Slave replication is generally implemented in various storage components, such as DB, Reddis, MongoDB, and Kafka.

Clusters and partitions are used in both computing and storage architectures, and clusters are divided into two types: mutual standby and independent, which are symmetric and asymmetric, respectively.

Disaster recovery - multi-active in different places

On the basis of the original single center, multiple centers are replicated to cover applications and storage to form remote dual centers.

Failover

Failover of peer nodes, Nginx and service governance frameworks both support access to another node after one node fails.

For failover of non-peer nodes, heartbeat detection is used to implement active/standby switchover (such as Sentinel mode or cluster mode in Redis or Master-slave switchover in MySQL).

In addition, the interface level timeout setting, retry policy, and idempotent design.

Governance policies

Throttling processing

Throttling is a strategy that controls the rate of requests and is designed to prevent the system from being overloaded. By restricting requests, you can ensure that the system can cope with bursts of traffic in high-concurrency scenarios and will not crash due to resource exhaustion.

The advantage of the fixed window calculator is that it is simple, but there are cases where the current cannot be throttled in critical scenarios.
Bucket leakage is used to control the rate of consumers through queuing, which is suitable for instantaneous burst traffic scenarios, and in the face of constant traffic rise, the queuing requests are easy to time out and starve to death.
The token bucket allows a certain burst of traffic to pass through, and there is a risk if the downstream (callee) cannot process it.
The sliding window counter can complete the current limit with relative accuracy.
Adaptive throttling

Fuse treatment

A circuit breaker is a strategy that automatically cuts off access to faulty services to prevent fault propagation and system avalanches. When a service fails (such as continuous failures, timeouts, etc.), the fuse is automatically turned on to prevent further access to the service. During the fuse is open, the request is quickly rejected without consuming system resources. After a period of time, the fuse automatically enters a half-open state and tries to release some requests to test whether the service is restored. If service returns to normal, the fuse shuts down, and if it still fails, the fuse continues to remain on.

Downgrade processing

Degradation is a strategy to temporarily shut down some non-critical functions or reduce quality of service when the system is overstressed or malfunctioning. By downgrading, you can ensure the proper functioning of critical functions while reducing the burden on the system. Downgrades can be triggered based on different policies, such as error rate, response time, system load, etc. The degraded service can return default values, cached data, or error messages so that clients can handle the degraded situation

Premise: Do a good job of service grading.

Grayscale release

It can support small-traffic deployment by machine dimension, observe system logs and business indicators, and push the full amount after the operation is stable.

Observable

Metrics, logs, and trace traces constitute the three cornerstones of observability, providing us with capabilities such as architecture awareness, bottleneck location, and fault tracing. With observability, we can gain more comprehensive and granular insights into our systems and uncover deeper system issues to improve availability.

Logging

In ELK Dafa, the general log process involves burying, collection, processing, indexing, and retrieval. It is a complex big data processing project.

Metrics

Mainstream metrics include Prometheus, grafana, influxdb, and elastic.

Four golden indicators: Traffic (QPS), Latency, Error, and Staturation.

Indicator Design Principles: RED

Tracing

In a complex distributed system with a microservice architecture, a client request is processed by a large number of microservices in the system, which increases the difficulty of locating the problem. If a downstream service returns an error, we want to find the entire upstream call chain to help us reproduce and solve the problem, similar to the backtrace of gdb to see the call stack frame and hierarchical relationship of the function. Tracing generates an association identifier Trace ID when the first call is triggered, and we can associate the entire call chain by passing it to all subsequent calls via RPC. Tracing also uses spans to represent the relationships between individual calls in a call chain.

There are many open-source components that are commonly used to build end-to-end tracing, such as SkyWalking and Jager.

Toolchain

Identify as many risks as possible before a failure, and strengthen and prevent it in a targeted manner, rather than waiting for a failure to occur. There are relatively mature theories and tools in the industry, such as chaos engineering and full-link stress testing.

In summary, the high-availability solution is mainly considered from several directions such as redundancy, governance, trade-offs, and operation and maintenance, and at the same time, it is necessary to have a supporting on-duty mechanism and fault handling process, so that when online problems occur, they can be followed up and dealt with in a timely manner.

elasticity

Elastic Architecture is a system design that automatically adapts to different loads and failure scenarios. The resilient architecture is highly scalable, available, and fault-tolerant, and is able to operate reliably in the face of traffic fluctuations, hardware failures, or other anomalies. The goal of a resilient architecture is to ensure that the system can automatically adjust resource allocation under different operating environments and business needs, so as to achieve a balance of optimized performance and cost.

To achieve the purpose of resilience, the following aspects need to be done:

Scalable
fault tolerance
automation
Resource management and control

Consistency issues

Having already introduced the theory of CAP and BASE, reprocessing the consistency problem in a distributed system is a very tricky problem. Ensuring that multiple nodes have the same copy of data at a given moment is challenging because of the need to deal with issues such as node failures, network latency, and message loss.

Commonly used conformance solutions include:

Two-Phase Commit (2PC): Two-phase commit is an atomic commit protocol used to ensure the consistency of distributed transactions. In a two-phase submission, there is a Coordinator who manages the Participants' submission process. The agreement is divided into two phases: a pre-commit phase and a commit phase. In the pre-commit phase, the coordinator asks all participants if they are ready to commit, and in the commit phase, the coordinator decides whether to commit or roll back the transaction based on feedback from the participants. Two-phase commit ensures distributed consistency, but can cause blocking and performance issues.
Three-Phase Commit (3PC): Three-Phase Commit is an improved version of two-phase commit that reduces blocking issues by introducing a timeout mechanism and an additional preparation phase. In the three-phase submission, the agreement is divided into a preparation phase, a pre-commit phase, and a submission phase. Compared with two-phase commit, three-phase commit has better fault tolerance and performance, but it still has certain limitations, such as relying on reliable network and clock synchronization.
Paxos Algorithm: The Paxos algorithm is a distributed consensus algorithm based on message passing. It uses multiple rounds of voting to reach consensus among distributed nodes. The Paxos algorithm can tolerate a certain number of node failures, but in practice it may be affected by message latency and complexity.
Raft algorithm: The Raft algorithm is an algorithm that provides strong consistency for distributed systems, with similar features to the Paxos algorithm, but easier to understand and implement. The Raft algorithm coordinates the consistency of distributed nodes by electing a leader. The leader is responsible for handling client requests, log replication, and state machine updates. When a leader fails, the other nodes automatically elect a new leader.
Distributed locking: Distributed locking is a method of implementing mutually exclusive access to shared resources in a distributed system. By using distributed locks, you can ensure that only one node can access the shared resources at a time, thus ensuring distributed consistency. Distributed locks can be implemented based on databases, caches (such as Redis), or dedicated distributed coordination services (such as ZooKeeper).
Eventual Consistency: Eventual Consistency is a method of relaxing consistency requirements, allowing distributed systems to have data inconsistencies for a short period of time, but eventually reach a consistent state. Eventual consistency can be achieved through technologies such as asynchronous replication, vector clocking, and Conflict-free Replicated Data Types (CRDT). Eventual consistency has better performance and scalability than strong consistency, but it can lead to data inconsistencies.

At last

High concurrency is indeed a complex and systematic problem, and due to the limited space, such as distributed trace, full-link stress testing, and flexible transactions are all technical points to be considered. In addition, if the business scenarios are different, the implementation scheme of high concurrency will also be different, but the overall design idea is basically similar to the scheme that can be used for reference.

Author: Translator Translator

Link: https://juejin.cn/post/7355685027857694761

System architecture patterns in ultra-high concurrency scenarios