laitimes

High availability - Isolation principle

author:JD Cloud developer

Preface

When discussing high availability, there must be a corresponding low availability or even unavailability, but no matter what kind of availability is described, there is an implicit consensus that there is no permanent stable operation of the system program.

In fact, a few decades ago, Turing also argued a similar problem, called the "shutdown problem", which is specifically described as: Can computer A be programmed so that the program can infer whether computer B will stop running in a limited time? Turing used a very concise but rigorous counter-argument to demonstrate the shutdown problem, and the specific way of argumentation will not be repeated here, and the final conclusion is that there is no such program, that is, unfortunately, we cannot solve the shutdown problem in a calculable way.

Therefore, whether it is empirical consensus or logical argumentation, we have to face the reality that there is no system program in the world that can maintain stability at all times.

Based on this reality, there are countless geniuses and applied practices that try to reduce the probability of unavailability of machines as much as possible. So I think when we talk about high availability now, it's about exchanging experiences on the road of trial and error, but luckily, there are countless geniuses and practical experiences before us for us to learn from.

1. High availability

The concept of high availability was gradually formed with the development of computer systems and the growth of demand. With the development of distributed systems and the popularization of the Internet, the requirements and technologies are constantly evolving and growing, and high availability has become an indispensable and important part of modern software system architecture.

There are many descriptions of the definition of high availability, but they all revolve around the theme of resilience to uncertainty, which is defined in Distributed systems as the proportion of time a system is in a normal state, and if a user cannot access the system, the system is said to be unavailable.

If HA problem governance is divided into four stages, it can be roughly divided into the following four stages: before the fault occurs, before the fault occurs, during the event (the fault occurs until the system or person perceives the fault), during the event (the time from the occurrence of the fault to the fault handling), and after the event (after the fault ends). According to the above classification, different stages can correspond to different techniques

1. Beforehand: replica, quarantine, quota, advance planning, and discovery

2. Incident: monitoring, alarm

3. During the event: downgrade, rollback, emergency plan, failXXX series

4. Afterwards: review, thinking, and technical transformation

Compared with other high-availability principles, the isolation principle is very general and easy to understand, as the cornerstone of ensuring high-availability systems, because the biggest role of the isolation principle is to respond to the protection principle of improving the system's resistance to various "black swan events".

二、隔离原则(Isolation Principle)

2.1. Definitions

As an abstract guiding principle, the isolation principle does not belong to a specific research direction or field, but is a cross-disciplinary design concept. The literal meaning of "isolation" has been well described, like Bulkhead Isolation.

If you want to vividly compare the role of the isolation principle in ensuring the high availability of the system, I think it can be described as the relationship between a giant ship sailing on the sea and the bottom cabin, when the wind and waves are calm, the giant ship moves forward steadily, and once it encounters serious disasters such as submarine reefs and collisions, the isolation and solidity between the cabins is the life guarantee for the stable progress of the giant ship.

High availability - Isolation principle



Specifically, the isolation principle adopts a series of methods in system design to isolate system components, services, resources and data to minimize mutual impact, so that the system can better disperse system risks and improve the overall availability of the system when fighting various uncertainties.

2.2. Implementation

Although the principle of isolation is very easy to understand and important, it is only an abstract principle or design method, not a strict theoretical framework, and in practice, it is difficult to quantitatively say that the system has been completely fully isolated.

作为系统设计的一部分,在微服务架构(Microservices Architecture)、数据库系统(Database Systems)、网络安全与隔离(Network Security and Isolation)、服务网格(Service Mesh)、应用程序隔离(Application Isolation)、环境隔离(Environment Isolation)、虚拟化技术(Virtualization Technology)等领域都可以称为隔离原则的具体实现。

The above description is a bit too abstract, taking the practical applications that can be exposed to daily development as an example, such as service splitting, color gateway, and JSF corresponding to the microservice architecture domain, transaction isolation and database partitioning are the concrete embodiment of the database system, JDOS, orchestration and deployment are the implementation of container technology, and tenant isolation and vertical computer room are applications in the direction of environmental isolation.

After careful analysis, the technical products guided by the isolation principle are ubiquitous in daily development, and even give us the illusion that the technology itself should be so, and the use of the latest technology seems to ensure the high availability of the system. But real-world experience tells us that maintaining high availability of the system is not so simple.

In fact, we can re-examine the day-to-day development process with the isolation principle in mind, and re-experience how the isolation principle is reflected in the existing technical architecture.

2.3 Experience and Principles

For an abstract principle, it is often summarized in engineering practice as a series of rule proposals. In the actual development and design, the isolation principle usually has the following implementation methods and suggestions, we can start with simple concurrent programming to feel the concrete embodiment of the isolation principle.

• Thread isolation

In daily development, we will always encounter scenarios with a limited number of machines, and the usual practice is concurrent programming, which simply completes the processing of a certain job through the interaction of several threads. However, when the work is more complex, we generally choose to customize the thread pool, or build our own wheels, introduce external open source components, etc.

The reason why we use thread pools is actually to realize the independent operation of multiple threads, so that one is one hundred. This is also known as thread isolation, which refers to the isolation of the thread pool, where the core business threads are isolated from the non-core business threads, and the problem of one request will not affect the other thread pools. For example, Netty's master-slave multi-threading, Dubbo's Connection Ordered Dispatcher's thread model (picture from the Internet)

High availability - Isolation principle
High availability - Isolation principle



However, when our work or requirements are complex, adding threads or thread pools is often not the wisest choice, because the communication process between threads is too tightly coupled to the environment. At this point, we usually consider encapsulating multiple threads into a process, using multiple threads to time complex operations.

• Process isolation

For example, splitting a project into sub-projects and physically isolating them from each other, or using namespaces, resource control, and other process isolation techniques. For example, front-end and back-end separation of projects and container isolation. (The picture comes from the Internet)

High availability - Isolation principle



But no matter how elegant our program design is, there is always a fatal problem that cannot be avoided, the machine is the entity of the program, and relying heavily on hardware resources is the fatal problem. Therefore, people often choose to upgrade the software architecture, such as using the microservice architecture, and further structure the program and hardware resources

• Cluster isolation

One of the indispensable processes in our current daily development is cluster deployment and update, deploying applications to multiple containers, and using clusters to isolate different services so that they do not affect each other, which can be regarded as a further upgrade of process isolation. (The picture comes from the Internet)

High availability - Isolation principle



Example diagram of cluster isolation

But this approach is limited by the environment in which the machine is deployed, after all, on Earth, a seagull flapping its wings can change the weather forever.

• Computer room isolation

Furthermore, we usually deploy the machines in different computer rooms, such as the Langfang and Huitian computer rooms that we commonly use.

From program to machine, machine to space, although technology is changing rapidly, we still need to rely on this simple way to resist risks. If we do find that we can still continue to increase the isolation, but we can stop here for a long time, and then we can simply shift the perspective from the physical level to the data traffic dimension.

• Read and write isolation

Most of the Internet projects are more reads and less writes, and our conventional way of dealing with it is read/write splitting, which on the one hand avoids the interference of read logic on write logic, and at the same time expands the read capacity, improves performance, and improves availability. In this architecture, the core starting point is the isolation of data and operations, but it has been done by different technical means. In fact, database and table sharding, hot and cold isolation, etc., are all specific implementations. (The picture comes from the Internet)

High availability - Isolation principle



• Hotspot isolation

When we look at business or traffic entrances in the same way, we can still summarize some rules, such as isisolating hot businesses into systems or services, such as flash sales and flash purchases.

3.3. Business practices

The above discussion is based on the exploration of existing development technologies and experiences, and it can be found that the principle of isolation is indeed ubiquitous, but ubiquitous does not necessarily mean that it exists, such as our business system that is constantly iteratively updated.

1. Transformation of vertical computer room

For the availability risk accident of the system, what impressed me the most was that when I first joined the company in 22 years, I experienced a serious online accident, in general, the computer room of a certain application was hanged, and it was a hardware problem in the computer room, which caused the R&D personnel to be helpless, and the entire transaction service link was completely fused. In fact, similar problems are occasionally found in subsequent network disconnection drills and chaos engineering. The impact of this kind of "black swan event" on the system is fatal, so after a long period of governance, I think the systems in charge of JD's R&D students should have completed the transformation of the vertical computer room.

Today, there has never been a similar problem, which should be the simplest and most direct embodiment of the principle of segregation.

background

Cross-data center disaster recovery means that services are not affected after a data center is completely disconnected, reducing the impact of data center failures and improving cross-data center disaster recovery capabilities.

Technical solutions

1) JSF vertical call transformation: the server provides different aliases for different computer rooms, and the caller configures the same computer room alias for calling;

2) NP mounts multi-VIP across data centers: multiple VIPs across data centers and in the same data center as the load balancer/service container IP;

3) JIMDB read vertical call transformation: Langfang application container reads Langfang jimdb node, Huitian application container reads Huitian jimdb node, and cannot use random read mode.

4) Active/standby cluster of ES dual-data center;

5) Peer-to-peer deployment across multiple data centers;

6) Directly connect to the JSF service registered on the color and modify it to the registered HTTP service. If JSF is directly connected to the merging terminal, you can configure different service aliases in the merging terminal according to the vertical invocation principle of the data center.

2. Transformation of ES double cluster

When we look at our applications in the same way, we also find some risks. For example, the ES double cluster of accumulated orders, for the ES of accumulated orders, it has always been the top priority to store the order information of the transaction link and provide the basic service of querying the order list information of the external system, so much so that we have established a special service for ES query and write application on top of ES.

In the past, the order ES was a dual cluster, each cluster was one master and two slaves, and it also had an evenly distributed traffic configuration on a daily basis, and a complete downgrade and shear plan. But the fatal problem is that both clusters are in one of the computer rooms. Yes, similar to the risk before the renovation of the vertical computer room, although it has never happened, the existence of such a risk is unacceptable.

Therefore, on last year's Double 11, we started the transformation of ES dual clusters, which are now stable clusters with two independent computer rooms to provide data services.

background

As an important level 0 system in the golden link, the core process of the order system is a single computer room and a single cluster, which has weak disaster recovery ability and poor recovery ability.

Technical solutions

The upper part is some of the stream-cutting logic we made to be compatible with the old data, and the second half is the effect of the vertical computer room transformation.

Previously, the Massacre Orders ES application would accept asynchronous event messages from the Massacre Orders middleware, write them to different clusters through different message traces, and use the version number in the message body to ensure the eventual consistency of the data.

We still use this writing logic and re-apply for two sets of ES cluster services in Langfang and Huitian.

For the read logic, we have modified the vertical data center and cluster call, retained the previous percentage traffic configuration, and added vertical backup calls to prevent the availability of vertical links from being reduced due to network jitter or accidents in a certain data center and increased the vertical backup call logic.

If you are interested in detailed design ideas and implementation details, you can check out this governance article written by our group: ES High Availability - Dual Cluster Transformation

High availability - Isolation principle



Use of disaster recovery read policies
Policy selection scenario remark
Percentage mode The performance of a single-sided data center is limited Reduce the percentage of restricted data centers
Percentage mode The unilateral data center is disconnected Switch the full amount to the normal data center
Vertical call in the mutual backup mode Everyday scenes Reduces the impact of timeouts from burr jitter
Vertical calls High-performance scenarios Nearby call, quick response

3. Traffic Isolation - Grouping

The two cases mentioned above are actually spatial isolation, and we can find that our system can always find some problems under strict scrutiny, and similarly, we can also switch perspectives to application traffic.

As an important level 0 system in the omni-channel golden link, the accumulation of orders supports the online format, and colleagues also undertake a large amount of offline business traffic.

But it's all deployed using the same code. On the one hand, it is difficult to ensure the continuous reliability of the code in the iterative update of daily requirements, which has been described above. Once a problem occurs in the transformation of online or offline business requirements, the overall link will be unavailable. More importantly, there are big differences between online and offline formats in terms of traffic peaks, timeliness and business models.

Therefore, in this scenario, the isolation of traffic is very important.

background

In order to prevent online traffic from affecting the immediate consumption of offline stores, it is necessary to distinguish the traffic and provide special container services for offline scenarios. In addition to the accumulation of orders, there is online and offline group isolation for quick rewind, transaction settlement, etc.

Technical solutions

High availability - Isolation principle



4. Data isolation - hot and cold data archiving

Finally, I would like to end with an elephant in a room. When we consider the high availability of the system from the perspective of data center, application, and traffic, we usually wonder why the system did not do this before. In fact, our technology is always advancing, and the architecture of the R&D system is constantly being practical, and some problems may only be exposed with the continuous enrichment of the business, and the software system is like growing with the business update.

This data isolation application is order UMS, this application actually provides very simple functions, and even the writer only has several systems inside the order system to write, saving the tracking information in the whole life cycle of the order, the full name is the whole process tracking information of the order. Simple query service for external providers. Even if there is no need for transformation for a long time, the service provided continuously and stably, and there have never been online problems and alarms.

But last year's Double 11 we finally found the elephant in this room, and the reason why we found it was because he already had 100 million order tracking data, so we had to complete the archiving transformation in fear.

background

1. 1000000000+ pieces of data, a large amount of data

2、 日常调用频繁,QPS-50/ms

Therefore, it is designed to be archived, and the hot and cold isolation is done, and the hot storage table only retains the order information within 90 days.

Technical solutions

High availability - Isolation principle



III. Conclusion

Through the above discussion of day-to-day development and business practices, we will find that the principle of isolation, although ubiquitous, does not necessarily exist, and the system can run stably, but not all the time. I think this is also the value of R&D students: continuous attention to the system, continuous review of the system, and continuous optimization of the system.

Read on