laitimes

Test the practice of multi-lane swimlanes in the environment

author:Flash Gene

1. Background

Youxian attaches great importance to the governance of the test environment to improve the efficiency of developers and testers. Since 2018, I have started the road of test environment governance, and I have been fortunate to witness several stages, the earliest time, in early 2018, the test environment was a few virtual machines, and the services that need to be tested were deployed, and the preemptive problem often occurred, and the service branch was just deployed and was covered by others, and the code was found to be incorrectly deployed after the test process, resulting in very low efficiency;

In mid-2018, the test environment was governed, more than a dozen sets of domain names and machines were initialized, the environment was isolated with domain names, and each business line preempted the environment through the release system for use, which temporarily alleviated the problem of code being covered, but the preemption problem was still very serious.

In 2020, the second wave of remediation was opened, and with the help of Docker's rapid expansion ability, a new round of governance of the test environment began, and a set of completely isolated full-link environments can be pulled up through the self-developed environmental management system (hereinafter referred to as the Aladdin system), and the resources are well isolated, but with the increase of the full-link environment, any full-link environmental problems may need to be developed to view, which also increases the maintenance cost of development and testing. At present, we are embarking on a new round of test environment governance to improve the efficiency of test environment use.

2. Introduction to multi-lane swimming

2.1. What is a multi-lane?

The multi-lane structure is borrowed from the concept of swimming competitions, where all athletes are in a pool and the competition track is divided, and whoever exceeds the track is considered a foul. So abstracted into the design, a complete set of service links is a swimlane, and the requested data is the athlete, which flows freely in their own swim lanes without interfering with the data of other swim lanes. The purpose is to concentrate the common services in the main lane, deploy the change service in the branch lane, and rely on the general service of the main lane, and request data through physical or logical isolation internally.

2.2. The goal of multi-lanes

On the one hand, it is to ensure the stability of the test environment and improve the test efficiency. On the one hand, the test environment is deployed with code under development, and the code is unstable, on the one hand, there is dirty data in the test environment, on the other hand, it is not important to the test environment, there is lack of monitoring, etc., on the other hand, the machine resources of the test cannot be aligned with the line, etc., which leads to the instability of the test environment. Therefore, the multi-lane solution proposes the concept of primary and branch lanes, and the stable code is deployed in the main lane to ensure the stability of the code, and then the developers only need to concentrate on maintaining a set of full-link environment, which reduces the maintenance cost and improves the stability of the main lane.

On the other hand, it solves the problem of preemption and improves development efficiency. When multiple business lines and multiple requirements are developed at the same time, they all need to isolate each other in a set of environments without affecting each other, but the machines in the test environment are limited, which will cause everyone to preempt each other or share the same environment, resulting in reduced test efficiency. Therefore, the multi-swimlane scheme proposes a specification to take back the developer's permission to deploy the main swimlane, and the system automatically deploys the latest stable code on a regular basis, while the developer only deploys the change service in the branch swimlane, and the rest relies on the services in the main swimlane, reducing the amount of deployment and resource waste, and because the branch swimlane is very lightweight, it can be created and destroyed very quickly.

2.3. The value of multi-lane returns

It has been verified by the multi-business team that the test environment creation time is at least 10 times faster, and it takes about 3 hours to create a test environment that requires copying the underlying database and pulling all the services of the whole link, but it only takes a few minutes to pull up a branch swimlane in the swimlane solution. Due to the shortness of branch lanes, more branch lanes can be created for different business groups or the same business group with different requirements. To sum up, it's faster, more and more stable.

Third, the technical scheme

3.1. System architecture

Test the practice of multi-lane swimlanes in the environment

The architecture mainly consists of three parts, the gateway layer, the RPC layer and the data layer.

  1. The gateway layer is mainly responsible for environment identification and environment identity injection, and the frontend isolates the environment by testing the domain name, for example, b2.missfresh.net/xx, resolves the environment identity b2 after the request is sent to the gateway, and then implants it into the HTTP header and transmits it downward.
  2. The RPC layer is mainly responsible for service discovery and selection, environmental identity transparent transmission, etc. Find the service in the corresponding environment through service discovery, select the specified service to execute through the custom routing policy, and continue to transparently transmit the environment identity to the downstream.
  3. The data layer is mainly responsible for data isolation and sharing in the test environment.

The logical structure is mainly divided into main lanes and branch lanes.

  1. The full-link stability code is deployed on the primary swimlane as a public environment to carry the default services of other environments to ensure that the request link is smooth.
  2. The branch swimlane only needs to deploy the change service, and the unmodified service uses the services in the main swimlane, such as the underlying commodity, inventory and other services, the sender, push and other components, in order to reduce the maintenance cost of public services and improve the efficiency of use.

According to the above process, some changes need to be made to the component:

  1. Added environment identity injection to the gateway. The test environment uses an open-source traffic gateway (Kong), and then customizes an additional plug-in to resolve the domain name and inject the environment identity into the HTTP header, and then transparently transmit it downward.
  2. Transparent transmission of link identifiers. Use the open-source Pin Point system to add or enhance the link transparent transmission plug-in to transparently transmit link identifiers. Pinpoint was chosen because it is embedded in the service in the form of JavaAgent, which is imperceptible to the service, and can be upgraded imperceptibly in combination with the deployment system, which is more friendly than using the SDK.
  3. Service awareness. A common set of Zookeeper ensures that each lane service is discovered in a timely manner, and each lane service is registered with an environmental mark.
  4. Choice of Services. Use Dubbom's new routing policy to match the service itself based on the environment ID of the service itself and the environment ID in the link.
  5. Data storage. If multiple environments share the underlying data, the code uses a domain name to configure the database, and the DNS service points to the same set of databases, for example, the configuration b2.mysql.missfresh.net and b15.mysql.missfresh.net domain names point to the same instance IP; if multiple environments isolate the underlying data, MySQL and Redis need to encapsulate a set SDK, which uses environment identifiers to write data to different libraries or instances, RocketMQ needs to encapsulate a set of SDKs to send messages to different queues through environment identifiers, and ElasticSearch can encapsulate a set of gateways at the upper layer and forward them to different indexes or instances through the routing function of the gateway.

The following is a breakdown of the isolation scheme of each module.

3.2. Service isolation

There are two directions for service isolation, physical isolation and logical isolation.

Solution 1: Physical isolation can be achieved by deploying multiple sets of Zookeeper isolation, allowing the consumer to obtain the list of providers only in the current Zookeeper and then call them.

Test the practice of multi-lane swimlanes in the environment

Solution 2: Logical isolation can be achieved by identifying the Provider and the Consumer, and then allowing the Consumer to call the specified Provider service through a custom load balancing algorithm.

Test the practice of multi-lane swimlanes in the environment

The advantage of physical isolation is that it is well isolated, but the disadvantage is that each environment needs a set of Zookeeper, which will affect efficiency when creating an environment quickly, and there is also a disadvantage that in order to achieve the process of the green line in the 3.1 architecture diagram, it is necessary to let the request of the main swimlane be reversed back to the branch lane, and the service of the main swimlane must listen to the Zookeeper of all branch lanes, so as to listen to the survival of the service in the branch environment, but this will cause the main swimlane to be very bloated and the code implementation is also very complex. Therefore, we chose the logical isolation scheme, added an identity when the provider was registered, and after the consumer obtained the list of providers, through the custom load balancing algorithm, find out the specified environment provider and call it.

Once the plan was decided, all we needed to do was make some changes to Dubbom. The first step is to inject the environment identity into the container, either through environment variables or through local configuration files, so that the service can be aware of which swimlane the current container belongs to after starting. The second step is to obtain the current environment identity when the Provider is registered, and then add a parameter (zone) to identify the current environment when the ServiceConfig.doExportUrls() generates the registration link. The generated registration link is as follows:

dubbo://127.0.0.1:10080/com.missfresh.xxxxService?anyhost=true&application=mryx&bean.name=ServiceBean:com.missfresh.xxxxService:1.0&default.dispatcher=message&default.service.filter=notice&default.threadpool=fixed&default.threads=300&default.timeout=1000&dubbo=2.0.2&interface=com.missfresh.mpush.xxxxService&logger=slf4j&methods=xxxx&pid=1®istry=127.0.0.1:2181&revision=1.0.0&side=provider×tamp=1622925206798&version=1.0&zone=b2
           

The third step is to implement the routing algorithm, which is not difficult to logical, and the consumer can find the corresponding provider through the current environment ID. For example, in a multi-registry scenario, the Route mode can be routed to different registries, while the LoadBalance mode can only select the provider filtered by the router list, it is not possible to dynamically select multiple registries, and this difference will also be used for subsequent intra-city active-active solutions, so Router is selected to implement it.

Test the practice of multi-lane swimlanes in the environment

3.3. Message isolation

The physical structure includes NameServer and Broker, while the logical structure includes Topic and Queue. Different isolation schemes can be designed for physical and logical structures, and the following three isolation schemes are available for reference.

Test the practice of multi-lane swimlanes in the environment

3.3.1.22 Queue

The idea of queue isolation is to have each swimlane use a specified queue, e.g. swimlane b1 only sends and receives messages on queue1 and queue2, and swimlane b2 only sends and receives messages on queues 3 and queue4. To achieve this effect, you need to override the load balancing algorithm between the producer and the consumer, and allocate the specified queue to send and receive messages. The disadvantage is that each swimlane environment needs to expand the queue, which makes it tricky to increase the consumption power, and the number of queues needs to increase dynamically as the number of swimlanes increases.

Test the practice of multi-lane swimlanes in the environment

3.3.2、Broker 隔离

The idea of broker isolation is similar to the idea of queue isolation, which is to let each swimlane use a specified broker, isolate messages through the broker physical level, and then send and receive messages in each swimlane in the specified swimlane. To achieve this effect, you need to mark which environment the broker belongs to with the producer and the consumer, which can be achieved through naming rules, and then rewrite the load balancing algorithm of the producer and consumer, and make the broker and the producer and consumer affinity. The downside is that each environment requires a broker to be deployed.

Test the practice of multi-lane swimlanes in the environment

3.3.3. Message gateway isolation

The idea of the message gateway solution is to block the selection of queues by adding a layer of brokers, etc., and the client only needs to send and receive messages with the environment ID, and the gateway determines the message destination and consumption logic. The disadvantage is that a gateway system needs to be developed for the test environment, which takes a long time.

Test the practice of multi-lane swimlanes in the environment

In summary, the broker isolation scheme was finally chosen. On the one hand, it is the consideration of queue scalability, and on the other hand, it is necessary to build a message gateway for the test environment with a long cycle and high development cost, and finally the intra-city active-active solution can be verified in advance, because the deployment structure of the intra-city active-active solution is consistent with the broker isolation structure, and the broker is used to isolate messages in the two computer rooms, and the two computer rooms also need to consume each other.

After the solution is decided, all that remains is to make some modifications to the MQ-SDK. The first step is to standardize the name of the broker and the instance name of the Producer and Consumer, and let the name be accompanied by the environment logo, and the rule code is as follows:

/**

 * 根据当前环境生成唯一标识

 * @return {当前环境}-{是否基准环境}-{PID}-{自增保证唯一}

 */

public static String genInstanceName() {

    String instanceName = String.valueOf(UtilAll.getPid()) + SPLIT + COUNT.incrementAndGet();

    instanceName = (Boolean.TRUE.toString().equalsIgnoreCase(benchmark) ? "1" : "0") + SPLIT + instanceName;

    instanceName = (StringUtils.isNotEmpty(zone) ? zone : DEFAULT_ZONE) + SPLIT + instanceName;

    return instanceName;

}

           

Finally, generate the ClientId effect for each instance:

Test the practice of multi-lane swimlanes in the environment

The second step is to rewrite the Producer load balancing algorithm. Implement the select method of the MessageQueueSelector interface to select the specified queue from all queues, and then specify the load policy when sending. For example, if the current environment is b2, all queues with the b2 prefix with the broker name will be returned. The main codes are as follows:

protected List<MessageQueue> groupByZone(List<MessageQueue> mqs) {

    // 优先从链路中获取环境标识

    String zone = Extractor.getGray();




    List<MessageQueue> localQueueList = new ArrayList<>(mqs.size());

    List<MessageQueue> benchmarkQueueList = new ArrayList<>(mqs.size());

    for (MessageQueue messageQueue : mqs) {

        String[] brokerNameArray = messageQueue.getBrokerName().split(MryxConfig.SPLIT);

        String queuePrefix = brokerNameArray[0];

        if (zone.equalsIgnoreCase(queuePrefix)) {

            // 当前环境队列

            localQueueList.add(messageQueue);

        } else if (brokerNameArray.length > 2 && RocketMQConfig.IS_BENCHMARK.equals(brokerNameArray[1])) {

            // 基准环境队列

            benchmarkQueueList.add(messageQueue);

        }

    }

    

    if (!localQueueList.isEmpty()) {

        return localQueueList;

    }

    if (!benchmarkQueueList.isEmpty()) {

        return benchmarkQueueList;

    }

    return mqs;

}

           

The third step is to rewrite the Consumer Load Balancing algorithm. Implement the allocate method of the AllocateMessageQueueStrategy interface to select a specified queue from all queues for consumption. For example, if the current environment is b2, only all queues with the broker name prefixed b2 are consumed to achieve consumption isolation. The main codes are as follows:

@Override

public List<MessageQueue> allocate(String consumerGroup, String currentCID, List<MessageQueue> mqAll, List<String> cidAll) {

    // 根据messageQuery中brokerName的环境标识分组

    Map<String/*machine zone */, List<MessageQueue>> mr2Mq = new TreeMap<>();

    for (MessageQueue mq : mqAll) {

        String brokerMachineZone = machineRoomResolver.brokerDeployIn(mq);

        mr2Mq.putIfAbsent(brokerMachineZone, new ArrayList<>());

        mr2Mq.get(brokerMachineZone).add(mq);

    }




    // 根据clientId的环境标识分组

    Map<String/*machine zone */, List<String/*clientId*/>> mr2c = new TreeMap<>();

    // 基准环境的clientId

    List<String> benchmarkClientIds = new ArrayList<>();

    for (String cid : cidAll) {

        String consumerMachineZone = machineRoomResolver.consumerDeployIn(cid);

        mr2c.putIfAbsent(consumerMachineZone, new ArrayList<>());

        mr2c.get(consumerMachineZone).add(cid);

        if (machineRoomResolver.consumerIsBenchmark(cid)) {

            benchmarkClientIds.add(cid);

        }

    }




    List<MessageQueue> allocateResults = new ArrayList<>();

    // 1、匹配同机房的队列

    String currentMachineZone = machineRoomResolver.consumerDeployIn(currentCID);

    List<MessageQueue> mqInThisMachineZone = mr2Mq.remove(currentMachineZone);

    List<String> consumerInThisMachineZone = mr2c.get(currentMachineZone);

    if (mqInThisMachineZone != null && !mqInThisMachineZone.isEmpty()) {

        allocateResults.addAll(allocateMessageQueueStrategy.allocate(consumerGroup, currentCID, mqInThisMachineZone, consumerInThisMachineZone));

    }

    

    // 寻找没有匹配上zone的MessageQueueList

    for (String machineZone : mr2Mq.keySet()) {

        if (mr2c.containsKey(machineZone)) {

            continue;

        }

        // 2、如果存在基准环境consumer,则把没有消费者的messageQueue分配给基准环境

        if (!benchmarkClientIds.isEmpty()) {

            if (machineRoomResolver.consumerIsBenchmark(currentCID)) {

                allocateResults.addAll(allocateMessageQueueStrategy.allocate(consumerGroup, currentCID, mr2Mq.get(machineZone), benchmarkClientIds));

            }

        } else {

            // 3、如果没有基准环境,则没有消费者的messageQueue再次分配给consumer

            allocateResults.addAll(allocateMessageQueueStrategy.allocate(consumerGroup, currentCID, mr2Mq.get(machineZone), cidAll));

        }

    }

    return allocateResults;

}

           

Fourth, after the main process is smooth, you also need to consider the special situation, because the branch swimlane is not full-link, so the upstream producer of the branch swimlane may not be deployed, or the downstream consumer is not deployed, in order to ensure the smooth link, a set of main swimlane back-up logic is required, if the branch swimlane producer is not deployed, the main swimlane will be sent to the corresponding broker according to the link ID, the orange line in Figure 3.3.2, if the branch swimlane Consumer If it is not deployed, it will be consumed by the main swim lane to ensure that the link is unobstructed, as shown in Figure 3.3.2 in the green line. The above is the solution of message isolation.

3.4. Storage isolation

Storage isolation mainly uses physical isolation schemes, which are relatively simple to implement, for example, the code points to the database address with a domain name, and the real database address can be specified in the configuration host when the container is created, and another scheme is a variable configured in the code, the environment variable is configured by the container, and finally the real address is replaced when the variable is read in the code. The logical isolation scheme may involve modifying the business code, which is cumbersome and has problems with business stability, so it is rarely used

Due to the different business requirements for data isolation in the test environment, some teams only want to maintain one underlying data, while others need data isolation to run automated tests. At present, swimlanes are still the underlying shared data storage, which has the advantage that every time a new branch is created, there is no need to create a database and synchronize data, which greatly improves the efficiency of environmental application and destruction. For automated testing and other data isolation requirements, we deploy a separate set of full-link environments.

Fourth, challenges

1. The swimlane scheme relies on the transparent transmission function of the Pinpoint link tracing system, and the link information will be lost in the thread pool scenario.

2. Component upgrade, due to the transformation of Dubbom, MQ-client and other components, it is necessary to unify the components, which involves component migration, compatibility and other issues, we are deeply involved in the transformation of the business side, with the help of automated testing, to verify the stability of the upgraded components.

5. Future prospects

1. Automation of the test environment. At present, the application or destruction of new swimlanes is not automated, and an environment management panel is required to list which existing swimlanes are being used by whom, which services need to be deployed in each lane, and the health status of the services. In this way, the environmental management is very clear, and the efficiency of everyone's use is improved.

2. At present, the underlying storage of the swimlane environment is not isolated, and is only open to developers, because testers are more sensitive to data isolation, so we are also further transforming components such as MySQL and Redis, and routing to different libraries or instances through environmental identities to achieve the effect of data isolation.

6. Summary

The multi-lane solution has been running in the test environment for some time, and has encountered some problems, and has developed its own set of solutions after exploration. This solution combines our own components and implements them on the basis of the Active-Active solution, so some isolation solutions are not optimal for the test environment, but they are more suitable for us. I hope to have some inspiration for you, and you are also welcome to discuss together.

Source-WeChat public account: Daily Youxian Technical Team

Source: https://mp.weixin.qq.com/s/bBZZjdC5dX8rMUFAhMw_Mg

Read on