Share the synopsis

1. Background

Second, the design ideas

3. The overall architecture of active-active

Fourth, the specific transformation plan

Fifth, the online link

6. Project Results

7. New problems and follow-up

It is easier said than done, and there are a lot of problems encountered in the process of double living, but it is difficult to express it perfectly in retrospect, and the reason why it takes so long to write is also this reason, I always hope to reproduce the thinking at that time, the details of the problem and the solution as much as possible, but when I write it, I find that what can be given is a reasonable solution that we think is reasonable after many times of polishing and groping;

In addition, the content related to containers, publishing platforms, underlying network operation and maintenance, monitoring and other components is not included in the field of vision and technical capabilities, and only focuses on the design and transformation of business teams and middleware components.

1. Background

In 2022, based on the anxiety and thinking about stability, the trading platform linked with the middleware platform to start the exploration of remote multi-active projects, although the transformation of core applications and basic components was completed, but it was not really put into production under the influence of the epidemic & cost reduction and efficiency increase, and there was also a lack of sufficient testing and large-scale verification of online traffic.

Recently, there have been serious failures from time to time in external friends, such as:

After the explosion and change of the active-active architecture in the same city, I no longer have to worry about downtime......

The myriad of failures in the above situation is a wake-up call to the need to build the capacity for rapid recovery. Problems are almost inevitable, but if you can control the scope of the impact and shorten the impact time, the damage can be minimized.

On the one hand, all problems will be exposed to C-end users, with a large range of influence and not like toB/toM scenarios, avoiding peak periods and even being unaware; on the other hand, the traffic is high, the pressure is high, and it is easy to face sudden traffic and emergencies, and the string of stability needs to be tightened all the time.

In order to reduce the network loss caused by cross-computer room calls in the early stage, many applications are bound to storage components (db/redis/hbase) and the availability zone where the core depends on the downstream.

In order to avoid the long-term unavailability of the main link of Dewu transaction in extreme cases, the team decided to launch the intra-city active-active project, with the goal of quickly building dynamic traffic switching capabilities and rapid recovery capabilities, while reducing the difficulty and workload of transformation, and not adding a lot of additional costs. The team discussion decision bypasses the most complex and problematic data synchronization (DB two-way synchronization, Redis two-way synchronization, etc.), and does not need to do DB write-ban when traffic is switched, which has relatively large operability and implementability.

Of course, it is more thorough - each computer room has a full amount of data and applications, and if there is a problem in a computer room, it can completely undertake traffic from the closed loop, but the complexity and cost increase will be more obvious, so this road was not chosen this time. In other words, individuals are more inclined to implement quickly with low cost and low risk to achieve functional construction from 0 to 1, rather than a large and comprehensive solution, in case of problems during the period, they can only cry out in vain. Of course, at this stage, it is still a good deal to build a relatively low-risk and low-investment active-active in the same city, accumulate more basic capabilities and train the team at the same time, choose the most suitable solution for the moment, and solve the problem that is currently ranked first.

Draw a simple diagram to distinguish the difference between our intra-city hyper-active solution and the industry's long-distance hyper-active solution.

1. Active-active in different places

Key features:

There are two copies of storage-related, which are read and written separately in the two computer rooms, and synchronized in both directions
The circular assignment of data needs to be handled in a key way
The synchronization delay problem between data will be obvious, but it can basically be called from the closed loop in the respective data center
For the processing of user and merchant assets, such as user coupons and seller inventory, it is generally necessary to consider maintenance (gzone) in a certain computer room to avoid overselling and overuse caused by data synchronization problems
When switching streams, you need to prohibit the writing of local data in the target data center to prevent dirty data from being generated

2. Intra-city active-active

Peculiarity:

However, if there is a problem in the data center where the data is located, the other data center cannot normally undertake the traffic (only part of the data can be used, such as CDN, cache, and other scenarios with data source)
There is no need to consider data issues with the nature of a central node, such as user coupons and inventory
There are many cross-data center accesses, especially reads and writes at the data level, which may cause a sharp increase in RT

Whether it is in the same city or in another place, active-active or multi-active (dual-active is only the simplest scenario in multi-active, and the difficulty of double-active to three-active should soar no less than <羊了个羊>the difficulty of the first and second levels in the city), it is all for the following goals:

Improve reliability: Reduce the risk of a single point of failure by deploying services in different physical locations. Even if one data center fails, other data centers can take over the service to ensure business continuity.
Load balancing: It can flexibly distribute user request traffic to avoid overloading a single data center, especially if a single cloud vendor's data center is no longer able to provide more resources with the expansion of business scale.
Disaster recovery: You can quickly recover from a data center fault by switching traffic scheduling and reduce service interruption time.
Cloud cost: Under the premise of high technology maturity, multi-activity between the same cloud, cross-cloud, or even cloud + self-built IDC data center can reduce the dependence on a cloud vendor and obtain certain bargaining power on the one hand, and on the other hand, the multi-activity itself can have more possibilities in improving resource utilization.
Improve quality of service: This is especially true in geo-multiactive scenarios, where network latency can be reduced and faster response times and higher quality of service can be provided by distributing traffic across multiple centers.

Second, the design ideas

Description: Construct a dual-cluster deployment at the application level in multiple availability zones (i.e., multiple physical data centers) in the cloud data center, and cooperate with the blue-green release that has been launched on a large scale in the transaction link to complete the dynamic switching of traffic (including HTTP, RPC, DMQ[rocketmq/kafka]). Storage (Redis/DB) is still in a single data center (but can be deployed across data centers), reducing the complexity of the solution and implementation.

3. The overall architecture of active-active

As you can see, the whole is divided into four layers at the architecture level:

Access layer: DNS domain name resolution + SLB active/standby + DLB + DAG multi-data center deployment to ensure high availability of the access layer. Among them, the strategy of controlling blue-green traffic based on user ID and traffic ratio is implemented in the DAG.
Application layer: The application is divided into logical blue-green clusters through transformation, and cross-region calls are made through the sticky shielding of blue-green cohomology.
Middleware layer: Multiple middleware components have different cross-AZ deployment strategies, data synchronization, and active handover policies, which are described in detail below.
Data layer: The data layer maintains a copy of data and ensures service availability in the event of a data center-level fault through automatic/manual master-slave switchover and cross-region deployment, including DB, Redis, and Hbase.

Fourth, the specific transformation plan

This hyperMetro involves three main parts, namely: the active-active transformation of the transaction application side, the dual-active transformation of the transaction-dependent application, and the transformation of middleware & basic components. The following are the following:

1. Active-active transformation on the trading application side

1) Project scope

On the one hand, the invocation relationship between internal applications is complex, and the workload of differentiation, processing, and sorting is extremely high; on the other hand, rapid business iteration will also change the dependency between each other, and the cost of maintaining this set of logic is too high; and the internal strength and weakness dependencies themselves are also dynamically changing, so that the team students can constantly identify which should be active-active and which should be single-point, and the communication and execution costs are higher.

2) Business transformation ideas and plans

The complex link topology in the actual business scenario can be abstracted into the superposition and combination of the following typical atomic link topology (A-B-C).

Services A and C participate in active-active and need to be deployed across AZs. Service B does not participate in HyperMetro and does not need to be deployed across zones.

Services A, B, and C need to identify traffic coloring and obey traffic scheduling.

The owners of the relevant services upgrade the unified basic framework integrated in the service to a specified version and access the family bucket of non-intrusive, zero-configuration, and out-of-the-box blue-green release capability components. Ensure that the runtime traffic scheduling capability based on blue-green release is fully integrated. In the preceding diagram, services A, B, and C need to perform this step.
The owners of related services migrate the release mode in a white-screen format on the release platform interface. When the release mode is migrated to blue-green release, the publishing platform automatically deploys service pods across AZs and injects process-level meta information into the pods to support traffic scheduling. The blue-green publishing capability component intervenes in traffic coloring and traffic scheduling when the upstream caller LoadBalance is performed. In the preceding diagram, services A and C need to perform this step.

After the above transformation is completed, the traffic on the active-active link is called nearby and the availability zone is closed, that is, after the traffic is colored, each hop call on the subsequent link will give priority to the instance in the downstream service cluster that is the same color as the traffic (in the same availability zone).

2. The transaction dependent party applies active-active transformation

Relying only on transaction-side applications cannot complete all P0 links, and the supply chain side timeliness is relied on when placing an order. Strongly dependent foreign services are also included in the scope of intra-city active-active transformation. Its transformation points are basically the same, and will not be repeated.

3. Middleware & Basic Components

1) Identify the availability zone of the machine resource

At the beginning of the project, we found that container PODs and ECS instances lacked availability zone identifiers, making it impossible to distinguish the corresponding resource ownership. So we worked with our colleagues in the O&M team and the monitoring team to develop a specification. The corresponding tags are marked in the environment variables, which is also the cornerstone of monitoring and logs to reveal the markings of the computer room.

2) Middleware RTOs

Intra-city active-active requires middleware to provide services even if a single AZ fails. The RTOs for which it is designed are as follows:

3) Active-active transformation scheme for main components

(1) DLB - self-developed traffic gateway

DLB is a stateless component that is deployed peer-to-peer in two zones.
When one of the zones fails, the faulty nodes are eliminated from the endpoints of the SLB instance and traffic is sent to the normal nodes, achieving the goal of fast fault recovery. It is expected to be completed in seconds.

(2) Rainbow Bridge - Self-developed distributed relational database proxy

On the one hand, automatic switchover is too complicated, on the other hand, it is easy to bring more risks, and it also depends on the primary/standby switchover at the DB level, so it is expected to be completed in minutes if the switchover is manual.

Currently, 99% of the traffic goes through the cluster in Zone A and 1% of the traffic goes through the cluster in Zone B, and when an availability zone fault occurs in Zone A, all traffic can be manually dispatched to the cluster in Zone B, and the DB layer needs to complete the primary/standby switchover (a->b).

③DMQ

Brokers are used to disperse to different zones at the shard level to form a complete set of clusters.

When an availability zone fails, the available shards of the cluster are reduced by half, and the cluster as a whole is available.

The transformation of DMQ went through many trials and errors, and was initially achieved by creating multiple consumer groups on the consumer side, but it required multiple upgrades on the business side, which would lead to double the number of consumer groups on the consumer side, and then it was decided to put the main transformation work inside the rocketmq broker. A brief introduction is as follows:

Blue-green attributes

THE QUEUE IN THE BROKER IS SET TO AN EVEN NUMBER AND >=2. We regard the first half of the queue as a logical blue queue and the second half of the queue as a green queue (we can also see here that a lot of the processing logic in HyperMetro is either/or, but if you are active, the complexity will be higher).

producer

When selecting a queue, select the queue based on the blue-green color of the cluster environment.

Messages from the blue cluster will be delivered to the broker in the first half of the queue
Messages from the green cluster are delivered to the second half of the broker's queue

Within each selection logic, selection is made in a round-robin manner, without destroying the fault-tolerant logic supported by the producer itself.

consumer

Consumers are similar. Blue consumers consume messages from blue queues. Green consumers consume messages from green queues.

④Kafka

Since ZK's ZAB protocol requires the survival of Math.floor(n/2)+1 odd nodes to select the master node, ZK needs to deploy in 3 AZs, similar to the nameserver above. Spread across 3 Availability Zones, A:B:C number of nodes = 2N:2N:1, ensuring an odd number of cluster nodes at all times.

Brokers are deployed peer-to-peer in two Availability Zones, and the master-slave partition is deployed across zones. When a zone fails, the partition leader is switched.

⑤ES

For Elasticsearch multi-zone deployment, you need to distinguish between data nodes and master nodes.

Data nodes: You need to keep the nodes in each AZ peering to ensure the balance of data, and use partition sensing to separate the primary and secondary shards and keep them in different AZs.

master node: Deployed in at least three AZs to ensure that any AZ is up, the master election will not be affected.

(6) Registry

PS: Self-developed distributed registry, based on the raft protocol to achieve system availability and data consistency. Undertake the responsibility of publishing/subscribing to the RPC service of the whole site.

Proxy nodes are deployed in multiple zones to ensure multi-zone active-active deployment
The Raft nodes of the Sylas cluster are deployed in 3 zones to ensure active-active in multiple zones

4) Traffic allocation strategy

(1) RPC flow rate

The ingress traffic of an active-active RPC is adjusted on the DAG, and the DAG will try to allocate traffic based on the user ID.

Each app appends the current blue-green logo to the context of the request;
If an application is not included in the Active-Active, the blue-green identifier will be lost, and there are two strategies:

a. Random allocation, but it will destroy the integrity of the link;

b. Calculate again based on userID, but need to add processing of the ark configuration.

(2) MQ traffic ratio

This is because the producers and consumers of the blue-green cluster bind the queues. Therefore, as long as the message ratio of blue-green producers is adjusted, the proportion of consumption traffic of the entire MQ can be adjusted. The message ratio of blue-green producers is generally determined by RPC traffic. Therefore, if you adjust the traffic ratio of RPC, the traffic ratio of MQ will also be adjusted accordingly. However, there will be a lag (5-10s).

Fifth, the online link

1. Preliminary preparation stage

The overall idea is determined:

Based on the current blue-green release to do active-active, each blue-green release process is a dual-active cut-in drill, to avoid long-term non-use, rush or disrepair when you need to use it
The service layer is deployed active-active, the data layer is not transformed, DB and Redis achieve high availability through their own master-slave switchover, and the slave nodes are distributed in different availability zones
All services in the transaction domain + core link related to the external domain services are transformed into active-active

Sort out all business scenarios, MQ scenarios, container deployment status, and database & cache master and slave node availability zones.

All services in the transaction domain > external services that are strongly dependent on the core business scenario, specific business scenarios that are strongly dependent, whether they can be downgraded & whether there is a fallback
MQ usage: DMQ, Kafka, or others, and whether you need to ensure the order of messages
The AZ where the current machine of all services is located and whether it is bound to a fixed AZ
The AZs where all databases in the transaction domain and the master and secondary nodes corresponding to Redis are located
Jobs that depend on ZooKeeper

Scope of Assessment:

Upstream and downstream non-transaction domain communication and confirmation (services that must be included in the scope of transformation, and services that can be transformed without active-active must have a back-up)
HyperMetro involves the upgrade of the service JAR and the access to the blue-green release that is not connected to the blue-green release
In the case of cross-region calls, the RT increases significantly, and the interface is optimized

Whether you need to access the nearest read transformation of a self-managed Redis database in some business scenarios?

On the O&M side, the self-managed Redis solution is provided with the nearest read solution, but data consistency is sacrificed, and all parties comprehensively evaluate whether access is required based on the actual business scenario and the RT situation of the interface

2. Development & Validation Phase

1) Service jar upgrade: Supports HyperMetro blue-green stream cutting, MQ blue-green sending & consumption

2) Dual-active blue-green dyeing test environment construction and test process improvement:

The environment itself is set up: service blue-green cluster splitting, binding availability zones, and container blue-green cluster machine proportional configuration
Active-active blue-green dyeing environment code version verification, code admission rules, branch automatic merge rules, test process flow, etc
The active-active blue-green staining environment was set as the second-round round2 environment for testing, and the dual-active process was normalized for regression verification in daily iterations

3) Biactive blue-green staining test environment regression:

Normal business process regression
Test environment blue-green shear regression
Test the MQ production & consumption switch-stream regression in the test environment
Record, compare, and optimize the RT situation of core business interfaces

4) When the global channel of the hyperactive staining environment is open, the blue-green release channel is switched back:

Verify Channel Priority: Publish Channel Priority > Global Channel

5) Split blue-green clusters in the pre-release environment

At this point, the pre-release environment is equivalent to the actual completion of the active-active transformation

6) Focus on the pre-release environment verification > RT issues

7) Dismantle a machine for all online active-active transformation services and go to Area B to observe & verify the RT rise problem:

Most of the services of the trading platform were previously bound to Zone A, and each service was deployed to Zone B separately to observe the RT of the interface

8) DMQ upgrade blue-green 2.0 supports consumption according to blue-green standards

3. Online preparation & online stage

1) The log platform, monitoring platform, trace link, and container upgrade support blue-green mark

2) The DMQ in the production environment is switched to Blue-Green 2.0, and consumption is supported according to the blue-green standard of active-active data

3) Switch between database and Redis master nodes to ensure that the master and slave nodes are only in area A or area B

Most of them are in districts A and B, with exceptions. The core is that the masternode must be in these two zones

4) The online service splits the blue-green cluster (manual), the project is officially launched, and the regression verification > RT issues are concerned

5) The green cluster (Zone A) is expanded to 100% machines, and the blue cluster (Zone B) is maintained at 50% machines, and grayscale observation is carried out for 5 days

6) Special optimization of online RT rising interface technology

7) Iterative upgrade of active-active guarantee on the release platform

Added support for adding services to the HyperMetro blue-green cluster with one click
Active-active blue-green clusters support batch expansion by zone (in the case of a single data center failure, the service in the surviving zone can be quickly pulled up)

8) The container platform supports container management and control multi-zone deployment

6. Project Results

On December 14, 2023, after 5 days (12.14-12.18) of observation and verification of high traffic (DLB traffic reaching 77.8% of Double 11) before Christmas, it was confirmed that there was no obvious abnormality, and then the online cluster was scaled-in. In some scenarios, the RT has increased by a certain percentage (the data layer only does cross-AZ disaster recovery, but does not achieve nearby access, so all data-plane calls in the blue cluster need to be cross-AZs), and a small technical project has been launched to promote optimization.

In terms of actual results, after cross-data center traffic switching during the release of version 12.22, the transaction link has the ability to schedule cross-data center traffic, as follows:

1. Traffic performance

(Zone A - Green Cluster, Zone B - Blue Cluster)

The cluster traffic in the two zones reaches 50:50. However, due to the existence of a small number of upstream and downstream applications, RocketMQ has not carried out multi-active transformation, and there are still small traffic that is not strictly distributed

核心指标 qps/rt/错误率

Access to core infrastructure components

Since all data stores (db, redis, and hbase) are in Zone A, the RT of Zone B has increased to a certain extent, and the overall increase is about 7-8ms (there is a scenario where multiple data queries are requested once), and the optimization is still being promoted.

2. Cost situation

At the same time, there is a 5-day parallel period (100% resources in Zone A and 50% resources in Zone B, a total of 150% resources) before the service stability is deployed in Region B, during which a small amount of costs are generated.

After the end of the grayscale parallel period, 50% of the resources in Area A will be released, and the overall cost will return to the original average line, and no additional costs will be incurred.

7. New problems and follow-up

1. In blue-green releases, if the downstream is connected to HyperMetro but does not enter the release channel, the consumption traffic will be skewed, for example, during the upstream traffic switchover, RPC or MQ will give priority to the call of the current zone, that is, the traffic ratio of the other zone will be affected.

2. If the RT is not added to the downstream active-active or some storage/cache middleware, such as DB/HBase/Redis is not enabled for nearby read, the RT of data center B will generally be 5-8ms higher. Optimization has been gradually put into practice.

3. As an infrastructure, container management and control needs to ensure normal operation in the event of a data center-level failure, and can successfully complete the scaling operation, that is, the multi-AZ deployment of the container management and control plane, which is still under construction.

4. In the case of a data center-level failure, whether there are enough available resources for batch expansion of a single data center (especially during the big promotion, the cloud vendor's own resources are tight).

5. Active-active linkage between multiple domains, such as transactions and searches:

Whether the dual-active switching of two large domains needs to be linked (linkage: the influence range is amplified, and it is not easy to expand the capacity of the search and push side; no linkage: the active-active traffic of each domain is very fragmented)
Whether the same blue and green marks are recognized between two large domains (self-closed-loop within each domain ensures access to the same region or also needs to be guaranteed between large domains)

6. How to conduct a realistic walkthrough without damage online.

The above problems are all new challenges brought about after the active-active, and they are constantly thinking about and investing in solutions.

No matter what you do, no matter how you do it, there will always be new problems in life, don't you? Keep a long-term view lol...

作者丨Alan 英杰 Matt 羊羽

Source丨Official Account: Dewu Technology (ID: gh_13ba5621e65c)

The DBAPLUS community welcomes contributions from technical personnel at [email protected]

Event Recommendations

The 2024 XCOPS Intelligent O&M Manager Annual Conference will be held on May 24 in Guangzhou, where we will study how emerging technologies such as large models and AI agents can be implemented in the O&M field, enabling enterprises to improve the level of intelligent O&M and build comprehensive O&M autonomy.

Conference details: 2024 XCOPS Intelligent O&M Manager Annual Meeting - Guangzhou Station

After the explosion and change of the active-active architecture in the same city, I no longer have to worry about downtime......

Event Recommendations