laitimes

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture

author:JD Cloud developer

Preface

When I first came into contact with the concept of high availability, I had a vague sense of the boundary between the "less dependency principle" and the "weak dependency principle" of high availability, and even some "stupid and unclear". Both of these principles focus on reducing dependencies between modules, but they do have certain differences.

So, what is the essential difference between the "principle of less dependence" and the principle of "weak dependence"?

Both the principle of less dependency and the principle of weak dependency aim to improve the reliability and stability of the system, but the essential difference between them lies in the management and control of dependencies.

1. Less Dependency Principle: This principle emphasizes the independence of modules in the system design stage, with the purpose of reducing the risk of fault propagation from the source, and reducing the coupling degree between modules, so that each module can independently complete specific functions and reduce unnecessary dependence.

2. Weak Dependency Principle: The weak dependency principle focuses on how to manage and control the dependencies between modules during system operation, so that when one module fails, other modules can still operate normally. By implementing weak dependencies, the system has better fault tolerance and high availability.

Of course, these two principles are not absolute, there is a certain correlation between the two, and we can flexibly adjust the dependencies between modules according to the complexity of the system and the actual needs. In practical applications, the principle of less dependency and the principle of weak dependency can also cooperate with each other to jointly improve the high availability of the system.

1. Architecture strategy based on the principle of weak dependency

Weak dependence principle: must be relied on, as weak as possible, the weaker the better thing A strongly depends on thing B, once B has a problem, then A will also have a problem, and both will lose. Therefore, any strong dependency should be transformed into a weak dependency as much as possible, which can directly reduce the probability of problems.

1. Microservice architecture

Module splitting: A microservices architecture splits a complex application into multiple independent, composable service modules. Each service has clear functional boundaries and responsibilities, and is loosely coupled to each other. In this way, when one service fails, the other services can continue to operate independently, ensuring the overall availability of the system.

Independent deployment: Each module has its own independent code repository, which can be deployed and upgraded independently without the cooperation of other modules. When a service fails or needs to be upgraded, it will not affect the normal operation of other services. Especially for the system on the gold trading link, if possible, try to provide independent data resources (DB, redis) and carry out vertical group deployment.

It should be noted that deployment isolation requires sufficient deployment resources and upstream and downstream cooperation, so try to do this work in advance. Of course, module isolation must be based on low coupling to make sense. If the coupling between components is confusing and confusing, module isolation will only exacerbate the confusion.

2. Asynchronous communication

Asynchronous communication can be regarded as a further decoupling on the basis of module isolation, which further weakens the strong dependency between physically divided modules, so that faults cannot propagate and spread, and improves system availability. The architectural implementation of asynchronous is mainly the use of message queues, when one module fails, the other module can continue to process the task without much impact.

Take the functional architecture upgrade project I am doing as an example, in which the scenario of creating an employee account: after a new employee submits a request to create an account on the PC page, the employee information needs to be persisted, and the created SMS and email are sent to the employee, and the employee information needs to be synchronized to the human resources system.

If microservices are invoked synchronously, any failure in subsequent operations will cause business processing failures and employees will not be able to create a successful service. By using the asynchronous architecture of message queues, the new employee will respond with "created successfully" immediately after sending an MQ when creating a new employee, and the subsequent operations will be completed by consuming messages, even if an operation fails, subsequent compensation will not affect the employee's creation process.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 1.1 Create an employee business process diagram

3. Interface abstraction

Over-coupling is the root of all evil in software design and the main culprit of system availability problems. A highly coupled system can be described as "pulling the whole body", and any small change can lead to unexpected bugs and system crashes. Even the most basic functional maintenance is already difficult, let alone talk about high availability.

We can achieve loose coupling between modules by defining an abstract policy interface, which is usually abstracted from multiple classes with common characteristic behaviors, and the specific implementation classes are left to the factory class to complete. In this way, when one module changes, it will not affect the normal operation of other modules. I will discuss the abstraction of the interface in detail in the "Actual Scenario Analysis" below, for specific cases.

4. Failover and fault tolerance

Set up a complete fault handling mechanism, including fault detection, failover, and fault recovery. When a fault is detected, the system can be quickly switched to a standby component or service restored, guaranteeing the availability of the system.

Data sharding: When data is stored, it is distributed across multiple storage nodes. When a storage node fails, data can be recovered from other nodes, improving data availability and fault tolerance.

Read/write splitting: In scenarios where weak consistency is accepted, read operations are assigned to the slave database and write operations are assigned to the primary database to improve system performance and stability, and master-slave switchover is supported.

Downgrading: When the number of strongly dependent services in a system is small, the overall basic stability of the system will be higher. For those systems that rely more on special data and less on logic, we can adopt a de-dependent architecture design strategy. Specifically, it persists and heterogeneously relates dependent service data to its own database, and performs synchronous update and maintenance in an asynchronous manner, so as to reduce the dependence on other systems and further improve the stability of the system. However, there are drawbacks to this approach: data redundancy can lead to data inconsistencies within a specific time window.

5. Loosely coupled business logic

Decouple business logic and make it independent of each other. For example, at present, the three formats of specialty stores, large supermarkets and super stores in the offline store warehouse system are a common set of code, and the design between each format is loosely coupled, and the implementation of different business expansion points is isolated from each other, so that when a business logic fails, other business logic can continue to run and reduce the scope of the failure.

2. Actual scenario analysis

1. Case 1: Weak dependency on middleware

i. Weak dependency on message queues

In the daily development of distributed transactions, a series of atomic operations such as RPC calling, writing DB, and sending external messages may be involved in the same transaction, and there may be exceptions in the request of a certain link, so in order to ensure the eventual consistency of the transaction, a failure retry strategy is required. For a simple application process, it is sufficient to roll back an abnormal service interruption, but it is not feasible for a complex business process, and when a request exception occurs, the upstream application may have already been executed, especially if multiple asynchronous processes are combined into a whole process, and other pre-existing processes may have been executed and cannot be rolled back.

For example, in the case of a store warehouse production out-of-stock cancellation document, if all the products are out of stock or the user selects the "out-of-stock cancel order" delivery strategy, the downstream interface will be called to cancel the order if the goods are out of stock. After the pre-operations (such as RPC invocation and DB write) are completed, the order cancellation API fails or is abnormal. If the call fails, a UMP alarm will be triggered, which requires manual intervention.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.1 Write-back OPC process for production out of stock

In order to avoid the above problems and ensure the eventual consistency of the data, the initial optimization adopts the self-production and self-consumption method of MQ to retry.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.2 JMQ Fails to Resolve Callback Cancellation API

In this way, the stability of the business system is strongly related to the stability of JMQ middleware, and there are naturally higher requirements for the stability of JMQ. In order to reduce the strong dependence on JMQ and ensure the smooth execution of the business, the user experience is improved through technical means and the pressure on the R&D staff is reduced, and the task retry tool is finally formed.

The core idea is to split distributed transactions into local transactions for processing, and the specific implementation method is to put the tasks in the database, ensure that the business operation table and the pure task table are in the same database, and ensure the strong consistency of business operations and task persistence through database transactions. To a certain extent, business operations are decoupled from middleware dependencies.

The mechanism of callback function is used to realize the decoupling between the caller and the underlying driver, which improves the flexibility of the component and is less intrusive to the business.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.3 Task retry component workflow

ii. Weak dependency on databases

The second scenario involving weak dependency on middleware is the weak dependency of the database, which is not uncommon in daily development, such as network link problems, performance degradation caused by slow SQL statements, and failures. These problems often lead to the failure of the transaction gold link to work normally in a short period of time, which brings a lot of losses to the business. In order to cope with these situations, we consider introducing a disaster recovery mechanism to ensure that we can maintain a high transaction success rate and ensure the timeliness of order fulfillment under abnormal circumstances.

The core idea of this solution is to temporarily store data through other storage media (such as redis) during the period of time when the DB operation fails. Then, the DB operation is restored through MQ asynchronous compensation to ensure the eventual consistency of the data.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.4 Data disaster recovery solution

This solution significantly improves our ability to deal with database operation exceptions and ensures the smooth operation of the gold transaction link. By implementing a disaster recovery strategy, we ensure eventual data consistency and effectively reduce the impact of failures on the business.

2. Case 2: Dependency inversion decoupling business logic

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.5 Defining an abstract interface for decoupling

i. Background

The dependency optimization case at the code level is based on the design principle of dependency inversion, and decouples business modules. In the process of requirements iteration, we often directly depend on concrete classes for the sake of diagram convenience, that is, the so-called high-level modules depend on low-level modules. However, this is extremely detrimental to scaling, as new functions are added, the functions of the system will become more and more bloated, and the core functions will become more and more obscure, in this case, the high availability of the system will be affected.

Take a look at the following example: in the historical code, the execution of the invoice cancellation logic is very complex, and there are differences in the cancellation processing logic of documents of different types and sources in different production links, where the doCandel method is the high-level module, and the call to cancel interface and the sending of the cancellation message are the low-level modules, which is a typical high-level module relying on the encoding form of the low-level module.

ii. Implementation before optimization

The following code snippet is the historical code liability in the system, which is highly coupled and poorly readable and scalable.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.6 Historical code

iii. Optimized implementation

The principle of weak dependencies emphasizes that dependencies between modules should be made as weak as possible. This means that the interaction between modules should be as simple as possible and avoid complex dependencies. The core idea of our optimization is to use the factory mode + template mode to abstract the interface to realize the logical differences in the subsequent processing of documents in different links.

a. First, we define an abstract policy class AbstractDoCancelNodeStrategy to disassemble the core process after canceling the document, and finally define the four steps of the disassembly into four abstract methods.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.7 Abstract policy class definition

b. Then, we create 4 specific implementation strategy classes, which are used to deal with the logic of different links of the cancellation document. It mainly provides different implementations of the same behavior, and the business can choose to enter different implementation classes according to different conditions.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.8 Different policy implementations

c. Secondly, create a strategy factory to obtain cancellation documents for different production links: adopt the method of loading policies at startup, when the project starts, put the instance of the implementation class of the interface in the Map, and during the operation of the system, you can find the identity of the implementation class by canceling the key corresponding to the node, and carry out the corresponding logical processing.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.9 Strategy Factory

d. In this way, the higher-level modules can rely on this policy class instead of the specific policy implementation. In this way, when the business requirements change, it is necessary to cancel the message broadcast for the new document type, and only a new processing policy class needs to be implemented without modifying the code of the high-level module.

How to ensure the high availability of the system through the principle of weak dependency in the microservice architecture



Figure 2.10 High-level module code

After this optimization, it is equivalent to indirectly decoupling the high-level module from the underlying module logic of specific RPC calls. And it provides perfect support for the open-close principle, and can flexibly add new algorithms without modifying the main process code. In short, the architecture based on abstraction is much more stable than the architecture built based on details, so in our daily development, we should try more face-to-face interface programming, and use top-level design first and then details to design the code structure.

3. Strength depends on governance

Service dependencies are an important factor in determining the complexity of a system, and as the business continues to iterate, service dependencies may become more and more complex, making the system difficult to maintain and scale. Without a clear dependence on strength and weakness, it is difficult for us to carry out operations related to circuit breaker, degradation, and current limiting, and we cannot effectively optimize and transform the system and continuously promote the improvement of system stability. As a result, service dependency governance becomes critical. We need to regularly check whether our dependency model is reasonable, identify unreasonable dependencies and rationalize them. Specific governance processes include:

1. Dependency tagging: Through manual combing of code, all dependencies on the core link of the system are sorted out, and the dependencies and strengths are analyzed and labeled.

2. Verification of strong and weak dependencies: The core idea of simulating link failures by means of chaotic engineering and other methods is to constantly "find trouble" for the system to verify the system capabilities, and simulate a scenario where a dependent service fails, so as to verify the effectiveness of manual annotation.

3. Dependency governance: The goals of dependency governance are reflected in the following aspects:

•Filter out those parts that are not really strong dependencies and convert them to weak dependencies to minimize strong dependencies.

•Decouple strong dependencies, establish a degradation plan for core links, and keep the plan alive.

•Perform reasonable exception capture logic for weak dependencies, and configure reasonable timeouts, circuit breakers, and current limiting. For specific business scenarios, with the scenario as the smallest unit, write a stop-loss controllable fall-back logic, and configure the corresponding dynamic switching switch, and when an exception occurs, you can switch to the fall-back logic with one click.

•Weak dependency supports smooth deactivation, and supports the protection of the army in unexpected scenarios.

epilogue

Of course, in addition to the above measures to build a high-availability system that follows the principle of weak dependency, there are also some high-availability architecture solutions:

For example, in the architecture design, we also need to take into account different levels of anomaly monitoring (business layer, application layer, middleware layer, basic layer), data collection includes logs, buried points, link tracing, etc., and data alarms are notified to the on-duty personnel through telephone, SMS, email, Beijing ME, etc. Through the establishment of a sound monitoring system, the system operation status is collected in real time and early warning is carried out. In this way, when there is a potential failure in the system, it can be detected in time and measures can be taken to repair it.

In addition, after a new service with a large amount of transformation is launched, you can use DUCC to control grayscale stream switching to reduce the impact of software coding errors. Observe that there is no problem, and then switch the flow in full to ensure that even if the program has bugs, the traffic can be switched back, and the impact is controlled within a small range. In the face of failures, such a system has stronger fault tolerance and anti-failure ability, so as to ensure the stability and availability of the overall operation of the system.

Read on