laitimes

Common Design Scheme for High Concurrency Architecture (Collector's Edition)

author:Prose thinks with the wind

Preface

Since it is an application with hundreds of millions of users, high concurrency must be the core element of its architecture design.

In this article, we will introduce some common design options for high-concurrency architecture design.

Keywords: read/write splitting, data caching, cache updates, CQRS, data sharding, asynchronous writes

Key points for high-concurrency architecture design

High concurrency means that the system has to cope with a large number of requests. From the author's years of interview experience, when faced with the question of "what is a high-concurrency architecture", many interviewees often roughly believe that whether the design of a system meets the requirements of high-concurrency architecture is to see whether the system can cope with a large number of requests. When asked for specific details, the answers are often ambiguous, such as how many requests per second are high concurrency, how well the system performs, how well the system performs, and so on.

In order to clearly judge whether the design of a system satisfies the high-concurrency architecture, we must first clarify the necessary conditions for the formation of a high-concurrency system, the measurement indicators of high-concurrency systems, and the classification of high-concurrency scenarios before formally giving a general high-concurrency architecture design scheme.

Necessary conditions for the formation of a high-concurrency system

  • High performance: Performance represents the parallel processing capability of a system, under the same hardware device conditions, the higher the performance, the more hardware resources can be saved; At the same time, performance is about the user experience, and if the system takes too long to respond, users will complain.
  • High availability: The system can provide services stably and normally for a long time, instead of frequent failures, downtime, and crashes.
  • Scalability: The system can be horizontally scaled to cope with increasing requests or even sudden surges.

We can analogize the necessary conditions for the formation of a high-concurrency system to the attributes of a basketball player: "high performance" means that the player is strong in the performance of the game, "high availability" means that the player can always play stably on the court, and "scalability" means that the player has good future growth.

A measure of a high-concurrency system

1. High performance indicators

An easy metric that comes to mind that can give an indication of a system's performance is the average response time of the system over a period of time. For example, if 10,000 requests are successfully responded to in a period of time, the average response time of the system during that period is the average of the response time of those 10,000 requests.

However, the average has obvious flaws and is ridiculed in many statistical scenarios. Let's say you and the legendary basketball superstar Yao Ming are placed in the same group, your height is 174cm, and Yao Ming's height is 226cm, then the average height of this group is 2m! This seems very unreasonable.

For example, if 9,900 of the 10,000 requests have a response time of 1 ms, and the other 100 requests have a response time of 100 ms, the average response time is only 1.99 ms, which completely masks the problem of the 100 ms response time of those 100 requests. The main disadvantage of the average is that it is susceptible to extreme values, where extreme values are either large or small values – when there is a high value, the average value will increase; When there is a low value, the average value decreases.

The response time PCTn is the response time calculated by the author as a measure of system performance, which represents the response time of the nth percentile of the request response time, sorted from smallest to largest. Assuming that the response time for 100 requests over a period of time is sorted from small to large as shown in the figure, the response time for the 99th percentile is 100ms, i.e., PCT99 = 100ms.
Common Design Scheme for High Concurrency Architecture (Collector's Edition)

The higher the quantile value, the more sensitive it is to requests with long response times. For example, the response time of 10,000 requests is counted:

  • PCT50 = 1 ms, which means that 50% of the 10,000 requests have a response time of less than 1 ms.
  • PCT99 = 800 ms, which means that 99% of the 10,000 requests have a response time of less than 800 ms.
  • PCT999 = 1.2s, which means that 99.9% of the 10,000 requests have a response time of less than 1.2s.

According to the empirical data summarized by the author, the average response time of the request = 200ms, and the high concurrency system of PCT99 = 1s can basically meet the high-performance requirements. If the response time of the request is less than 200ms, then the user will not feel the delay; And if the response time of the request is more than 1s, then the user will noticeably feel the delay.

2. High availability metrics

Availability = System uptime/Total system uptime, which indicates the proportion of time that a system is up and running, and can also be understood as the probability that a system is available to the outside world. We generally use N 9s to describe how available the system is, as shown in the table.

Common Design Scheme for High Concurrency Architecture (Collector's Edition)

High availability requires the system to guarantee the availability of at least three or four nines. In the actual system metric monitoring, many companies will take the median of 3 9s and 4 9s: 99.95% (3 9s, 1 5) as the threshold for system availability monitoring. When the system availability is lower than 99.95%, alarm information is sent in time so that system maintainers can make timely optimizations, such as system availability remediation, capacity expansion, fault cause analysis, system transformation, etc.

3. Scalability metrics

In the face of the incoming burst of traffic, we obviously did not have time to transform the architecture of the system, and the faster and more effective way is to increase the nodes in the system cluster to horizontally expand the service capacity of the system. Scalability = Percentage of throughput increase/Percentage of cluster nodes. In the best-case scenario, a multiplicity of cluster nodes can increase the throughput of the system by several times. Generally speaking, a system with 70%~80% scalability can basically meet the scalability requirements.

Classification of high-concurrency scenarios

We use computers to implement various business functions, which will eventually be reflected in two operations on data, namely read and write, so high-concurrency requests can be classified as high-concurrency reads and high-concurrency writes.

For example, in some business scenarios, there are more reads and fewer writes, and it is necessary to focus on solving the problem of high concurrent reads. In some business scenarios, you need to focus on solving the problem of high concurrency writing. In some business scenarios, you need to solve the problem of high concurrent reads and high concurrent writes at the same time. The reason why high-concurrency scenarios are divided into high-concurrency read scenarios and high-concurrency write scenarios is that there are often different high-concurrency solutions in these two scenarios.

Database read/write splitting

Most Internet applications are more reads and less writes, for example, there are always more requests to swipe posts than to post, and there are always more requests to browse products than to place an order to buy goods. The database is under high pressure on concurrent requests, mainly from read requests. We can divide the database into a database that is responsible for processing write requests (write library) and a database that is responsible for processing read requests (read library) according to read/write requests, so that all write requests fall to the write database, and the write database synchronizes the latest data after the write request processing to the read database, and all read requests read data from the read database. This is the idea of database read/write splitting.

Database read/write splitting separates a large number of read requests from the database, reducing the pressure on database access and shortening the request response time.

Read/write splitting architecture

We usually use the database master-slave replication technology to implement the read/write splitting architecture, with the master of the database master node as the "write database" and the database slave as the "read database", and one master can connect to multiple slaves, as shown in the figure.

Common Design Scheme for High Concurrency Architecture (Collector's Edition)

All major databases on the market have implemented master-slave replication technology.

Read/write request routing

In the database read/write splitting architecture, write requests are handled by the master and read requests are handled by the slave. Generally, the following two methods can be used.

1. Proxy based on database

Add a database proxy node between the service and the database server, and all operations on the database by the service need to be forwarded by the proxy. After receiving the database operation request from the service service, the proxy classifies the SQL statement in the request and forwards the request for write operations (such as the insert/delete/update statement) to the database master and the request for read operation (such as the select statement) to any slave of the database to complete the read/write splitting route. Open source projects such as MySQL-Proxy and MyCat in the form of centralized proxies and MySQL-Router in the form of local proxies have implemented read/write splitting.

2. In-app based approach

The main difference between the application-based method and the database proxy based method is that it performs request read/write splitting in the business service process, and database connection framework open source projects such as gorm and shardingjdbc have implemented this form of read/write splitting.

Master-slave latency and resolution

The database read/write splitting architecture relies on the database master-slave replication technology, and the database master-slave replication has a data replication delay (master-slave delay), which leads to inconsistencies in the master-slave data during the data replication delay, and the slave cannot obtain the latest data. There are three solutions to the master-slave latency problem.

1. Synchronous data replication

By default, master-slave replication is in asynchronous mode, and the master returns success after writing the data, regardless of whether the slave receives the data. We can configure master-slave replication to synchronous mode, and after the master finishes writing data, it will not return successfully until all slaves have received the data.

This solution ensures that the master and slave can read the latest data every time a write operation succeeds in the database. This solution is relatively simple, and you can modify the database primary-slave replication to the synchronization mode without the need to transform business services.

However, when processing business write requests, the master cannot return successfully until all slaves have received the data, the latency of write requests will be greatly increased, and the throughput of the database will also drop significantly. This solution has low practical value and is only suitable for use in business scenarios with low concurrent requests.

2. Mandatory reading of the Lord

Different business scenarios have different tolerances for master-slave latency. For example, user A has just posted a status, and he should display this status when he browses his personal homepage, which is not very tolerant of master-slave delays. When friend user B browses user A's personal homepage, he can temporarily not see the latest published status of user A, and this scenario can tolerate master-slave delay.

We can divide business scenarios according to the master-slave latency tolerance, and execute normal read/write splitting logic for scenarios with high master-slave latency tolerance. In scenarios where the master-slave latency tolerance is low, read requests are forcibly routed to the database master, that is, read masters.

3. Session separation

For example, if a session performs a write operation in the database, the read requests of the session are temporarily forcibly routed to the database master for a very short period of time, much like the example in the "forced read master" scenario, which ensures that each user's write operation is immediately visible to itself. The time period for temporarily forcing a master to read can be set to be slightly higher than the delay time for the database to complete the replication, so that the time period for forcing the master to read the master can cover the actual delay time for replication.

Local caching

In the computer world, caches are everywhere, such as CPU caches, DNS caches, browser caches, etc. It is worth mentioning that Cache is translated as "quick cache" in Taiwan, which more directly reflects its purpose: fast reading. The essence of caching is to ensure fast reading of data through the idea of exchanging space for time.

Business services generally need to send read requests to other services or databases through network calls. In order to improve the efficiency of data reading, the business service process can cache the obtained data into local memory, and then the business service process can directly obtain the data from the local memory when it receives the same data request, transforming the network request into efficient memory access logic. That's the main purpose of local caching. Local caching will be used extensively in the core service design section later in this book, and this section will focus on the technical principles of local caching.

Basic cache elimination strategy

Although the cache uses space for time to improve the efficiency of data read, the preciousness of memory resources determines that the local cache cannot be expanded indefinitely, and there is a trade-off between occupying space and saving time. This requires that the local cache can automatically eliminate some cached data, and the elimination strategy should try to ensure that the data that is no longer used is eliminated to ensure a high cache hit rate. The basic cache retirement strategy is as follows.

  • FIFO (First In First Out) policy: Prioritizes the elimination of the earliest data that enters the cache. This is the simplest elimination strategy and can be implemented on a queue-based basis. However, the cache hit ratio of this strategy is low, and the more frequently accessed data is, the sooner it will be queued and the sooner it will be eliminated. This strategy is rarely used in practice.
  • Least Frequently Used (LFU): Prioritizes the least frequently used data. The LFU policy maintains an access count for each cached data, and each time the data is accessed, the access count is incremented by 1, and the data with the lowest access count is the target that is eliminated. This strategy is well suited for caching hot data that is frequently accessed in a short period of time, but the latest cached data will always be eliminated, and the data that has been accessed frequently in the early days but has not been accessed recently will occupy the cache for a long time.
  • LRU (Least Recent Used) policy: Prioritizes the least recently used data in the cache. This strategy is generally based on a combination of doubly linked lists and hash tables. The doubly linked list is responsible for storing the cached data, and always places the most recently accessed data at the tail, so that the cached data is sorted in the doubly linked list according to the last access time from far to near, and each time the data that is eliminated is the data at the head of the doubly linked list. Hash tables are responsible for locating the location of data in bibiquitously linked lists for fast data access. This strategy can effectively improve the cache hit ratio of hot data in the short term, but if cold data is accessed sporadically or in batches, hot data will be eliminated and the cache hit ratio will be reduced.

The disadvantage of both LRU and LFU policies is that they will cause a significant drop in cache hit ratio. In recent years, some more complex and effective cache elimination strategies have emerged in the industry, such as the W-TinyLFU strategy.

Distributed caching

Since the local cache caches the data in the memory of the service process, it does not require network overhead, so the performance is very high. However, caching data into memory is also somewhat restrictive, for example, as shown below.

  • Unable to share: Local caches cannot be shared between multiple service processes.
  • Programming language limitations: The local cache is bound to the program, and the local cache component developed in Golang cannot be used directly by the server developed in Java.
  • Poor scalability: Because the service process carries the data, the service is stateful. Stateful services don't have good scalability.
  • Memory volatility: The service process is restarted, and all cached data is lost.

We need a multi-process shared, language-agnostic, scalable, data-durable cache, and this cache is a distributed cache.

Distributed cache selection

The mainstream distributed caching open source projects are Memcached and Redis, both of which are excellent caching products, and both have the ability to share cached data and be independent of programming languages. However, compared to Memcached, Redis is more popular, which is mainly reflected in the following.

  • Rich data types: Memcached only supports data type caching of strings, while Redis supports data type caching such as strings, lists, collections, hashes, and sorted sets.
  • Data persistence: Redis supports data persistence through the RDB mechanism and AOF mechanism, while Memcached does not have data persistence capabilities.
  • High availability: Redis supports master-slave replication, which ensures uninterrupted caching services through master-slave switchover operations after a server encounters a failure. Redis has high availability.
  • Distributed capability: Memcached does not support distribution, so the distributed caching system based on Memcached can only be implemented by the client side with a load balancing algorithm such as consistent hashing. Redis has the official decentralized distributed solution Redis Cluster, and the industry also has centralized distributed solutions such as Douban Cordis and Twitter Twemproxy.

Because Redis supports rich data types and data persistence, as well as high availability and scalability, it has become the first choice for distributed caching for most Internet applications.

How to use Redis caching

The logic of using Redis caching is as follows:

  1. Try to find the data in the Redis cache, and if the cache is hit, the data is returned.
  2. If the data is not found in the Redis cache, the data is read from the database.
  3. Save the data read from the database to the Redis cache and set an expiration time for this data.
  4. The next time you look for the same data in the Redis cache, the cache will be hit.

When saving data to the Redis cache, you need to set an appropriate expiration time for the data, which has the following two benefits.

  1. If you do not set an expiration time for cached data, data will always accumulate in Redis memory, especially those cached data that is no longer accessed or have a very low hit rate, which will always occupy Redis memory and cause a lot of wasted resources. Setting the expiration time allows Redis to automatically delete cache data that is no longer accessed, and for frequently accessed cache data, the expiration time is reset each time it is accessed to ensure a high cache hit rate.
  2. When data inconsistencies occur between the database and the Redis cache due to various failures, the expiration time is a good way to fall. For example, if you set the expiration time of cached data to 10 seconds, even if the data is inconsistent between the database and the Redis cache, it will last for up to 10 seconds. The expiration time ensures that the database and the Redis cache have data inconsistencies only within this time period, so that the data is eventually consistent.

In the above logic, there is a very risky operation: the data accessed by a request does not exist in the Redis cache, and the request accesses the database to read the data; However, if there are a large number of requests to access the database, it can cause the database to crash. There are only two possible reasons why a data does not exist in the Redis cache: the data has never been stored in the Redis cache, or the data has expired. Let's do targeted optimization for these two reasons.

Cache penetration

When a user tries to request an invalid piece of data that does not even exist in the database, the Redis cache will appear to be useless.

  • Try to look for this data in the Redis cache, and if it hits, the data is returned.
  • If this data is not found in the Redis cache, the data is read from the database.
  • If this data is also not found in the database, empty data is eventually returned to the user

As you can see, the Redis cache cannot prevent such requests from accessing the database directly. If hackers maliciously continue to send requests to access a piece of illegal data that does not exist, all these requests will penetrate the Redis cache and directly access the database, eventually causing the database to crash. This is known as "cache penetration".

To prevent cache penetration, if no data can be found in the database, you can save a null value in the Redis cache to indicate that the data is empty. In this way, all subsequent requests for this data will be blocked by the Redis cache, thus blocking the harassment of the database by illegal requests.

However, if the hacker accesses a large number of different illegal data instead of a single piece of illegal data, this solution will store a large number of useless empty data in the Redis cache, and even drive out a large number of valid data, greatly reducing the Redis cache hit ratio and putting the database at risk again. We can use a bloom filter to solve the cache penetration problem.

A bloom filter consists of a binary vector of fixed length m and k hash functions. When a piece of data is added to the bloom filter, k hash functions calculate k hashes for this data and modulo with m, and set the value to 1 at the N positions corresponding to the binary vector. If you want to query whether a certain data is in a bloom filter, you can look at these k position values in the binary vector after the same hash calculation:

  • If the value of any position is 0, the data to be queried must not exist.
  • If all locations are 1, the queried data may exist. The reason why it is possible is that the hash function will inevitably have the possibility of data collision, which will cause a misjudgment of a certain data in this case, but the false positive rate can be reduced by adjusting the values of m and k.

Although the Bloom filter has a certain misjudgment of "data exists", it is accurate to determine that "data does not exist". The Bloom filter is a good way to prevent cache penetration: you add all the data in the database to the Bloom filter, and check whether the data is recorded in the Bloom filter when a user requests access to a piece of data but cannot find it in the Redis cache. If the Bloom filter deems the data to be non-existent, the user requests no more access to the database; If the Bloom filter deems the data to be possible, the user requests continued access to the database; If this data is not found in the database, a null value is set in the Redis cache. Although the Bloom filter has a certain misjudgment of "data existence", the false positive rate is low. Finally, there are very few null values set in the Redis cache, which will not affect the Redis cache hit ratio.

Cache avalanche

If a large area of data in the Redis cache expires at the same time, all requests will be sent to the database. This situation is known as a "cache avalanche". The difference between cache avalanche and cache penetration is that the former is caused by the absence of a lot of cached data, while the latter is caused by the absence of a single cache data.

There are generally two causes of a cache avalanche: a large amount of data with the same expiration time, or a Redis service down. The solution for the first incentive is simpler, when you set the expiration time for the cached data, the value of the expiration time is randomly distributed in a preset small range, so that most of the cached data do not have the same expiration time. The second inducement depends on the availability of Redis, and choosing a highly available Redis cluster architecture can greatly reduce the probability of Redis service downtime.

高并发读场景总结:CQRS

Whether it is database read/write splitting, local cache, or distributed cache, it is essentially read/write splitting, which is also the CQRS pattern that is often mentioned in microservice architectures. Command Query Responsibility Segregation (CQRS) is a mode that separates data reading operations from update operations. query refers to read operations, while command is a general term for operations that cause data changes, and new, delete, and modify these operations are all commands.

Brief architecture and implementation of CQRS

In order to avoid introducing concepts related to microservices domain-driven design, the following diagram gives a brief architecture of CQRS.

Common Design Scheme for High Concurrency Architecture (Collector's Edition)

1) When the business service receives a command request (i.e., a write request) initiated by a client, it will hand over the request to the write data store for processing.

2) After the data change is completed in the write data store, the data change message is sent to the message queue.

3) The read data store is responsible for listening to the message queue, and when it receives the data change message, it writes the data to itself.

4) When the business service receives a query request (i.e., a read request) initiated by the client, it hands over the request to the read data store for processing.

5) The read data store returns the data that this request wants to access.

Write data storage, read data storage, and data transmission channels are all broad synonyms, in which write data storage and read data storage have different specific terms in different high-concurrency scenarios, and data transmission channels have different forms in different high-concurrency scenarios, such as message queues and scheduled tasks.

  • For database read/write splitting, the write data store is the master, the read data store is the slave, and the message queue is implemented in the form of database master-slave replication.
  • In the distributed cache scenario, the write data store is a database, the read data store is a Redis cache, and the message queue is implemented in the form of a message middleware that listens to the database's binlog data change logs.

Regardless of the scenario, you should choose a storage system suitable for high-concurrent writes for write data storage, a storage system suitable for high-concurrent reads for read datastorage, and a message queue should be robust enough as a data transmission channel to ensure that data is not lost.

Read on