laitimes

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

author:Flash Gene

Lead

AREX is an open-source traffic replay platform by Ctrip, which is incubated within the ticket BU. Focusing on the construction of the core link of recording and playback, from the construction of basic solutions to the in-depth implementation and verification of core business lines, we have accumulated a lot of experience and achieved visible results under the continuous iteration and optimization of the Group's complex business scenarios. Since its launch, Ctrip has been connected to 4,000+ applications, and the delivery rate and number of defects have improved.

This article introduces a series of challenges and solutions encountered by AREX in Ctrip's internal implementation process, and how AREX can quickly deploy a one-stop traffic recording and playback solution to reduce access costs and quickly implement it.

1. Background

Traffic recording and playback technology has a wide range of application prospects in performance testing, regression testing, automated testing, and rapid repair of online problems, which can help technical teams solve stability assurance and efficiency problems in the R&D process under complex business scenarios and system architectures.

However, when the technical solution is implemented, it will face many challenges, such as the difficulty of infrastructure construction, the disproportionality between the initial investment cost and the benefit, and the ambiguity of the landing scenario.

Second, the plan

At present, most of the known open source solutions on the market are based on Jvm-Sandbox-Repeater for secondary development and transformation, and the core principle is to record real online traffic and then play it back in the test environment to verify the correctness of code logic. So one might ask: why "reinvent the wheel" when there is a proven solution?

First of all, the JVM SandBox supports a limited number of components, which is far from meeting the needs of the middleware and frameworks widely used within Ctrip. Moreover, the underlying support of JDK is not thorough enough, such as asynchronous thread context passing, which needs to rely on other third-party components.

Second, although Jvm-Sandbox-Repeater provides basic recording and playback functions, to build a complete business regression testing solution, we also need a complete backend support system that is responsible for data collection, storage, and comparison.

Finally, the lack of official documentation and the lack of community activity make us face the risk of not being able to get official support in time in the subsequent secondary development process.

Based on these considerations, we decided to independently develop AREX, a traffic recording and playback platform:

1) It supports a wider range of middleware and component recording and playback, and can simulate various complex business scenarios, such as local cache, current time, etc.

2) As a comprehensive solution, it should also be equipped with complete supporting facilities, such as front-end interface, playback services and report analysis, etc., to achieve a one-stop workflow from traffic collection, traffic playback, comparison and verification, and report generation (as shown in the figure below).

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

In the following, we will delve into the challenges encountered in the implementation process, targeted solutions, and application examples within Ctrip, hoping to provide substantial help and guidance for your practice.

3. Technical challenges

3.1 Traffic collection in cross-thread and asynchronous scenarios

Traffic recording requires the collection of all link nodes involved in a service request, not only those of the main portal, but also requests and responses that call various frameworks, such as Mybatis, Redis, and Dubbo. However, many of the company's projects will use thread pools, asynchronous programming scenarios, for example, in a request, the main process will fork out a lot of subtasks/threads to work in parallel, some tasks query Redis, some will call RPC interfaces, some will operate databases, etc. to complete different business scenarios, and the underlying will also involve a large number of thread switching.

In this way, we need to ensure that these operations performed in different threads are collected in a single request, and we solve this problem through the idea of trace passing, that is, by decorating various thread pools and asynchronous frameworks, using a recordId to pass between threads in a way to concatenate them, and complete a complete use case recording. For example, CompletableFuture, ThreadPoolExecutor, and ForkJoinPool in Java, thread pools used by third-party Tomcat, Jetty, and Netty, as well as asynchronous frameworks such as Reactor and RXJava, can be used to transfer between different threads.

3.2 Playback of non-idempotent interfaces does not generate dirty data

For example, in key scenarios such as order placement and invoking third-party payment APIs, you need to ensure that mocks are used to avoid actual data interaction during traffic playback. Doing so prevents unnecessary data from being generated during the testing process and thus avoids disruption to normal business processes. The core mechanism of traffic playback is to intercept and mock framework calls, use the recorded data instead of real data requests, and ensure that no real external interactions, such as database write operations or third-party service calls, occur during the test, so as to effectively prevent dirty data from being written in the playback test.

At present, our Java Agent supports open source frameworks such as Spring, Dubbo, Redis, Mybatis, etc., please refer to the full list below.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

3.3 Playback failure caused by login authentication and token expiration issues

In the actual traffic playback process, we often encounter the following problem: many web applications implement login authentication verification before interface access. If authentication fails or the logged token has expired, API access will be denied, which may cause a large number of use cases to fail during playback. While some of these problems can be solved by configuring a whitelist, we are looking for a more general solution.

The ideal solution is to mock authentication frameworks such as Spring Security, Apache Shiro, and JWT during the playback process to bypass the authentication and token verification steps and ensure that the interface can be executed normally in the playback environment.

3.4 Time-sensitive services, such as playback of payment timeout scenarios

If the current time at the time of recording and the current time at the time of playback are not consistent, it may cause some unexpected differences in the timeout judgment logic. For example, in the scenario of determining whether the order has not been paid after the time limit, we usually use currentTime - orderCreateTime > 30 minutes. If the order has not timed out at the time of recording, but is played back half an hour later, the processing logic of the payment timeout may be incorrectly triggered due to a change in the current time of the system.

To solve this problem, we propose a solution: record the current time at that time at the same time during the recording process, and only record it once. During playback, Mock classes related to the current time, such as Date, Calendar, LocalTime, joda.time, etc., so that the current time used during playback is actually the time recorded during recording. This ensures that the time-related logic judgment during playback is consistent with that of recording, thus ensuring the accuracy and reliability of the test results.

3.5 Local Cache Issues

In apps, to improve performance, it's common to store some common data in a local cache for quick access. However, in the case of traffic recording and playback, the behavior of the cache may have an impact on the playback results.

During recording, if the requested data is already cached, the system serves the data directly from the cache without triggering a query to the database or external interfaces. However, in the playback environment, due to the lack of pre-loaded cached data, the same request may cause the application to query the database or call an external interface, resulting in a new call, resulting in playback failure.

To address this, we've implemented support for popular caching frameworks, such as Guava Cache and Caffeine Cache, to ensure that caching behavior can be simulated and consistent during playback. In this way, during playback, even requests for cache can return the expected result as at the time of recording, avoiding unnecessary new calls.

For those using custom caching frameworks, the AREX platform provides flexible configuration options that allow for dynamic adaptation. This means that even non-standard caching implementations can be compatible with the AREX platform and play back traffic correctly.

The above solutions are supported by default, and basically do not need additional processing, and if the framework developed by the company also needs to be recorded and played, it can be extended in the form of plug-ins.

Fourth, the landing challenge

4.1 Installation and deployment should be simple and convenient, quick to get started, and reduce access costs

AREX is a complete solution, in addition to the basic recording and playback functions, there are also front-end, scheduling, report analysis, storage and other supporting services. Based on the principle of out-of-the-box and fast access, AREX provides a variety of deployment methods: one-click deployment, non-container deployment, and private cloud deployment.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip
Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

In addition, AREX also supports stand-alone mode, which can be quickly used without local installation.

4.2 Comply with the company's risk control and data security requirements

When recording real traffic in production, when security or some commercially sensitive data is involved, it is necessary to use masking rules to deform the data for some sensitive information to achieve reliable protection of sensitive private data.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

AREX chooses to mask data when data is transferred to the database to protect the security of sensitive information. The specific implementation method is to load the plug-in JAR package through the SPI mechanism, and dynamically load the encryption mode.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

4.3 Improve user experience and quickly locate problems

In order to reduce the workload of users when analyzing differences, AREX aggregates the use cases of scenarios with the same differences to speed up troubleshooting.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

You can quickly locate the problem area through the call chain, and reduce noise on noisy nodes such as timestamps, UUIDs, and IP addresses to reduce interference.

If some online problems with complex services are difficult to reproduce locally, AREX also supports local debugging and quick troubleshooting.

4.4 Whether the technical solution is mature, safe and reliable

AREX is based on Java Agent technology and adopts the industry's mature bytecode modification framework ByteBuddy, which is secure and stable, code isolation, and has a self-protection mechanism, which intelligently reduces or disables the data collection frequency when the system is busy. And it has been running stably within Trip.com Group for more than 2 years, and has been fully verified online.

5. Best Practices

The traffic recording and playback service is currently integrated into the company's CI/CD system as a standalone option:

1) First access process: When accessing traffic recording and playback for the first time, you only need to select the Flight AREX Agent service in CI Pipeline, so that the AREX startup script arex-agent.sh will be included in the release package during the application packaging process.

2) Release and Agent Loading: During the application release process, the latest arex-agent.jar will be pulled after the previous script is started, and the AREX Agent (-javaagent:/arex-agent.jar) will be mounted by modifying the JVM Options.

3) Version control and grayscale release: After starting the script, the matching arex-agent.jar version will be pulled according to the AppId of the application to achieve grayscale release and on-demand loading, for example, only some specific applications will load the new functions of the agent.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

Again, if it's the first time you're playing back, it's simple:

1) Create a Pipeline: In Gitlab or Jenkins, create a Pipeline, call the playback address provided by AREX in the ArexTest Job script, and execute the pipeline regularly.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

2) Automatically trigger traffic playback: R&D personnel will automatically trigger traffic playback after submitting code.

3) Playback result push and release control: After the playback is completed, AREX will push the number of playback cases, pass rate, failure rate and other indicators to relevant personnel for statistics and analysis, and only when the pass rate reaches the predetermined standard, the code will be allowed to be released to the production environment.

The following figure shows how the AREX traffic recording and playback platform plays a role in all aspects of the company's R&D, testing and release, for your reference:

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

For each iteration, the test is automatically executed after the code is submitted, and the test report is fed back, so that developers and testers only need to focus on the R&D and verification of new services, get rid of those cumbersome data and scripts, and carry out multi-link targeted optimization and joint empowerment through traffic replay in the whole life cycle of software R&D, forming a closed loop of automated testing and continuous integration.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

Sixth, the results of the landing

With the continuous iteration and optimization of Trip.com Group's complex business scenarios, 4,000+ applications have been accessed, and the delivery rate and number of defects have improved.

Open source | The large-scale implementation of AREX, a traffic replay platform, on Ctrip

7. Embrace open source

After long-term stable operation and verified reliability within Ctrip, we will open source the AREX platform (https://github.com/arextest) in 2023, hoping to help more enterprises implement traffic recording and playback technology solutions efficiently and cost-effectively.

In the past year, we have been committed to building an open source community, and thousands of external users have connected to AREX.

AREX's vision is to ensure quality, reduce costs, and improve efficiency while rapidly iterating on requirements. This vision has been proven in the practice of Ctrip and many open source users, and has brought significant business value.

Looking ahead, we will continue to rely on the power of the active community to respond to and solve users' queries and continuously optimize AREX. We sincerely invite every developer to join the community and try it out to witness the growth and progress of AREX.

About the Author

Ctrip's AREX team, the Ticket Quality Engineering Group, is responsible for developing automated testing tools and technologies to improve quality and energy efficiency.

Source-WeChat public account: Ctrip Technology

Source: https://mp.weixin.qq.com/s/q-j_pFObn4yw8WzxKE8-jg