The theme of this sharing is the upgrade of the experiment management architecture and DDD practice of DataTester, an A/B testing platform under VeDI, a digital intelligence platform of Volcano Engine. The point here is that the first goal of the code is definitely to meet the needs of the product, and the code that can meet the needs of the product is good code. The evaluation of the quality of the code in this article is completely from the perspective of architecture, combined with the readability, maintainability and extensibility of the code.

Experimental management refactoring and DDD practice of Volcano Engine A/B test platform

文|王言鑫火山引擎DataTester团队

In the development process of a product or code repository, if the quality of the code is not controlled, the constraints of principles and specifications are not introduced, and the means are not taken in a timely manner, then the general development trajectory will be shown in the following figure over time.

early stage

In the early iteration of the project, it is very fast, and a requirement can be completed in a week to complete the development testing and launch, and the development efficiency is also maintained at a high level. At this point, everything is still in order.

metaphase

With the iteration of functions, the logic of linkage and reuse between modules and between functions may appear, which may slowly become technical debt if not refactored. Coupled with the increase in personnel input and personnel turnover, newcomers may not understand the original design ideas, and there will be a situation where they cannot understand the function by just looking at the code, and the cognitive load will begin to rise, and slowly they will find that although the manpower invested has increased, the efficiency of R&D has begun to slow down more and more. The system chaos began to slowly increase.

anaphase

Although the efficiency is reduced, the iteration of the features is still ongoing. But even small requirements that can be solved in a day will involve multiple changes, and it is uncertain how many changes will be needed to ensure the normal operation of the system. At this time, the cognitive load of the whole system has been overloaded, and it is not enough to just write good code, but also need to clearly understand the functional logic of historical code, otherwise a little carelessness will introduce oncall or complaints. With the increase of oncall, the R&D manpower is occupied again, which further reduces the R&D efficiency and requires additional time to repay the technical debt. At this point, the system has become very chaotic and is about to become disordered.

deathbed

As the chaos worsened, the team's combat effectiveness was almost zero, and it was only able to maintain existing functions, and it was difficult to complete the development and launch of new requirements in a short period of time. The development of the product technology has come to a standstill, and the efficiency has dropped to almost zero. At this point, the system has become a complete mess.

In the early days of the DataTester project, the functional estimation was largely accurate due to the simple and straightforward requirements. However, as the scale of the product expands and the complexity of the scene increases, it is clear that there are more and more dependencies and things to consider in the development of functions.

The following is a brief list of the relationship between functional modules and the increasing entropy of the system. It can be seen that from the initial programming experiment, to the visualization and multi-connection experiment, to the parent-child experiment, the push experiment, and finally to the internal and external merger, the complexity of the whole system is getting higher and higher, and if measures are not taken in time, then the subsequent maintenance and expansion will consume a lot of manpower.

Looking back at the historical development of software engineering, including object-oriented, microservices, and various domain models, they all represent different strategies for dealing with system complexity. As Professor John Ousterhout highlights in his book A Philosophy of Software Design, complexity can be defined as the factors that make software difficult to understand and modify, and the history of software technology is also a history of the struggle against "complexity".

So what exactly is complexity? Professor John Ousterhout makes it clear in his book that complexity refers to the factors that make software difficult to understand and modify. Complex systems usually have three distinct characteristics, which are abstracted by Professor John into the following three aspects:

Change amplification: This refers to seemingly simple changes that require code changes in many different places. In this case, the developer may not be able to refactor the code or extract the common logic in a timely manner. Instead, they may have adopted a quick copy-and-paste approach to code development to save time and reduce the risk of impacting existing stable modules. However, when requirements change, code changes need to be made in multiple places.
Cognitive load: This means that the system is quite expensive to learn and understand, thus reducing developer productivity. The high cognitive load means that developers need to spend more time and effort understanding how the system is structured and works.
Unknown unknowns: This means that developers don't know what code must be modified to make the system work, or if changes to the code will cause online problems. This is one of the most vexing manifestations of complexity, as it introduces uncertainty and risk.

The causes of complexity can be summarized in two ways: dependency and ambiguity. Too much external dependency leads to the amplification of functional changes and increases cognitive load, while the ambiguity of information increases the unknown unknown. These appearances, in turn, increase the complexity of the system, which in turn accelerates the "decay" of the system.

It is inevitable that the system will go from order to disorder, so can we only let the code become bad and helpless?

Fortunately, the answer is no. Software engineering has been developed for more than 60 years, and the problems we have encountered, our predecessors must have encountered, and we have sufficient theories and methods to combat the gradual chaos of the system. As shown in the figure below, although the increase in system complexity is inevitable, timely refactoring can slow down the speed of system chaos.

Over time, DataTester development has gone through multiple phases, each accompanied by different technologies, methods, and challenges, and each stage has its own major and minor contradictions.

In the process of team development, it is also necessary to adjust the organizational structure in a timely manner to adapt to the new environment and new challenges. Only change is the only constant. Much like team management, timely refactoring is critical in this ever-changing environment.

Refactoring refers to the process of adjusting and optimizing the internal structure of code without changing the external behavior of the software, with the aim of improving the readability, maintainability, and performance of the code. At different stages, refactoring has its own unique meaning and value.

For example, in the early stage of DataTester iteration, our goal may be to launch features as soon as possible to improve product competitiveness, so we should prioritize business iteration at this time. And as there is more feedback and more demand, more new features will be launched.

No one can predict what features will be added in the future, and what business scenarios will be there, so if you can't adjust the code and architecture in time with the iteration of the product, then the speed of chaos will inevitably increase.

The delivery of products needs to be evaluated from the three dimensions of manpower, time and quality, of which time is often referred to as "whether it can be delivered on time". The R&D and launch of the product require the cooperation of PM\BE\FE\UX\QA, and here we mainly focus on some problems encountered from the BE perspective. Every two weeks, some work is scheduled, but it is difficult to accurately assess the schedule.

The causes of the problem can be divided into the following categories:

The PRD description is not thorough enough, and the back-and-forth discussion virtually lengthens the development cycle
The technical solution is not considered rigorously enough, and some compatibility and adaptation issues are ignored
Historical baggage leads to the development of new features, which need to be adapted and adjusted in many places, and will affect other features

The third problem above means that the "bad smell" in the code is already serious. It is not surprising that the amount of work assessed and the actual amount of work will be very different. If the development students at this time do not have a deep understanding of the original functions, then the results can be imagined. In the optimistic case, the development of new functions only needs to complete the development work required by the module, which requires very high code encapsulation and isolation.

So if refactoring is so important, why isn't it taken seriously or executed in a timely manner? We can try to uncover the deeper causes from the common ones, which can be grouped into the following three categories:

It's not that I don't want to do it, it's that I don't know how to do it

The code is seriously corrupted, and there is a lack of precipitation and guidance of relevant specifications
Turnover of people makes it impossible to inherit the original design ideas

It's not that I don't want to do it, it's that everyone else does it

The service coupling is serious and cannot be encapsulated and isolated

It's not that I don't want to do it, it's that I don't have time to do it

Lack of a long-term perspective, believing that refactoring is a waste of time and does not help the business
Refactoring will not see significant benefits from the business side in the short term
Code quality is not taken seriously

As chaos increases, team productivity continues to decline, tending to zero. When productivity drops, management has only one thing to do: add more people to the project in the hope of increasing productivity. For tasks that can be disassembled, adding manpower can indeed shorten delivery times and improve efficiency.

However, for complex systems, newcomers are not familiar with the design of the system, and they do not know what modifications are in line with the design intent and what modifications are contrary to the design intent. And, they, as well as the rest of the team, are under terrible pressure to be productive. As a result, they create more chaos and drive productivity down to zero.

Therefore, it can be said that supplemental manpower can improve the overall progress and efficiency under certain conditions, but this is not absolute, especially for chaotic systems.

There is an interesting picture on the Internet, as shown below, the only criterion for evaluating the quality of code is the number of WTF sent out per minute in the code review room.

The only valid measurement of code quality: WTFs/min.

Of course, the quality of the code should be considered from many aspects such as scalability, maintainability, testability, and readability. While there are many rules and regulations, over-engineering is not necessarily good code. Therefore, how to balance business needs and specifications is also an "art".

Code is the product of thinking activities, and different developers have different thinking patterns, so they need to be constrained by good principles and specifications. "Dao Spell Instrument" is a concept in ancient Chinese philosophical thought, often used to describe the basic principles and laws of the universe and life. Could that also be used to guide software development?

For the architecture design of software, we can also think about the following four levels, from top to bottom:

road

"Dao" refers to the fundamental principle of all things in the universe, and can also be understood as the law and basic law of the operation of things. It is not subject to the subjective will of the person. The law of entropy increase can be seen as a law that makes the entire universe hopeless, and it is understood as the inevitable gradual decline of the structure of things. The law of entropy increase cannot be avoided, just as birth, old age, sickness and death cannot be avoided, but there are some means that we can delay the arrival of "ultimate disorder".

"Law" refers to the governing laws and methodologies of the universe and life, which can be classified as some classical principles and ideas in code development. After more than 60 years of development, software engineering has precipitated many guiding methodologies. For example, SOLID principles, various design patterns, and simple architectural design ideas: abstraction, encapsulation, and isolation. These methodologies can help us to refactor the code and reduce the complexity of the system in time.

technique

"Technique" refers to skills, techniques, and practical methods. In software development, "technology" can mean programming techniques, the use of frameworks, code architecture, and so on. Methodologies are often ideological guidance, and different people may have different understandings, and some business frameworks and programming paradigms are needed when they are actually implemented, such as domain-driven design, MVC architecture, dependency injection, and object-oriented. These principles can help us better carry out code layering and dependency reversal, and then achieve high cohesion and low coupling business code.

utensil

"Instrument" refers to the tools and resources used to practice and apply the principles of "Tao Spells". In software development, "tools" can include development tools, version control systems, automated testing tools, etc., the use of microservice architecture can better achieve functional isolation, and unit testing and CI/CD can better accelerate the iteration of functions and system refactoring.

Both the methodological and instrumental levels are now very mature. It should be a habit to think a little more when writing code and refactor the business code appropriately after the functionality is completed.

Here are some of the actual problems that exist in the project, which are problematic from a specification and architecture perspective, but are good code from a business needs perspective. In fact, everyone is aware of the potential problems with these codes (not that I don't want to do it, but everyone else does it), but the module "pulls the whole body" and new business is constantly added, and it is not a simple change that can be completed, so the "bad smell" will only slowly worsen.

/ No Business Tiering /

At present, there is no hierarchical relationship between the back-end code of Python, and the whole belongs to the standard procedural code, and a functional function may have hundreds or thousands of lines, and all functions are stacked in a function. Although some functional functions have been split, the whole is still a procedural logical process. There is little to no encapsulation and isolation of business logic.

/ Cycle / Repeat /

At present, the use of Django in KOI greatly facilitates the acquisition of external data, but it also leads to the proliferation of external calls. For example, in different functions, Application data may be required, but the parameters are only passed app_id, then it is likely to lead to another query operation, which is very common in koi. On the other hand, since Django's wrapper makes it easy to overlook that this is an external call, it's easy to write out the scenario of looking up the library in a loop.

/ Logical redundancy/decentralization /

Different check functions are piled up together, which makes it difficult to write a single test of the check function on the one hand, and the check function is difficult to reuse on the other hand, so that many check logics are repeatedly written, resulting in logical redundancy.

/ Complex Function Responsibilities /

Following the above example, there are business logic or data transformation operations in this verification function, and subsequent changes will be more difficult to maintain and test. Data verification and business logic should be separated to facilitate subsequent expansion and testing.

/ Unabstracted /

Not enough abstraction shows that different entities are doing similar operations, but there is no unified encapsulation and isolation of the operation, such as the implementation of the opening interface in the following code, which involves many experimental types of opening operations, all of which are inserted into their own logic through if else. If the abstraction is reasonable, it should be that different experiments should implement an interface opened by an experiment, and no differentiated processing can be seen in the main business process, so as to achieve better business isolation. Complex functionality is better abstracted by hiding behind simple interfaces.

/ Severe coupling /

The external calls of the current function are completely mixed with the business logic, so I will not give an example here. Therefore, the business logic is very dependent on external aspects, and changes in an interface or field may cause problems in the business logic. External dependencies should serve the business logic, and changes to external dependencies should not affect the business logic, so that dependency reversal can be achieved.

Solve the problems mentioned above through refactoring. Based on the existing business scenarios, the experiment management module is reconstructed to solve problems, build business code with high cohesion and low coupling, and improve the readability, maintainability, scalability and testability of the code.

By improving scalability, the development time required for subsequent internal function development is greatly reduced; Code reuse through encapsulation; Reduce the interaction between functions through isolation and reduce the probability of bugs; Achieve high cohesion and low coupling of business code through dependency inversion and separation of concerns.

Through architecture reconstruction, the overall team's awareness of architecture and norms is improved, the overall technical level is improved, and the team's strength is enhanced.

Next, we will introduce the specific work of experiment management refactoring. DDD mainly solves the problem of "where to write" the code, but the specific implementation details still need to be handled accordingly according to the specific scenarios of the business. This refactoring introduces the specific work of refactoring from the aspects of data structure definition, business validation, business logic and domain object construction.

/ Module carding /

The existing and subsequent new functions of the product are summarized in the following figure.

/ Domain Modeling /

According to the experiment function of DataTester, the experiment domain can be subdivided into four domain modules: logs, experiments, experiment layer management, and workflow. Among them, the management and workflow of the experimental layer are taken over by other service modules, so the log module and the experimental core module need to be refactored and improved under the experimental warehouse. However, the layer management has only done the adaptation of the internal deployment, and the external deployment has not yet completed the adaptation, so in the process of this refactoring, the layer-related logic will be abstracted to facilitate the subsequent docking after internal and external unification.

Log domains

The log domain mainly exposes the interface for obtaining operation logs, and provides the change-tracking capability of domain objects internally, and generates operation log files in the required format. Specifically, the log consists of two parts: operation log and global operation history. In addition, the change tracking capability provided by the ChangLog domain is expected to optimize database operations and reduce unnecessary save and update operations.

Experimental domain

The business logic of the experimental domain is more complex than that of the log domain. Based on the principle of extensibility and reusability, the function of the experiment is divided into three parts, namely BaseExperiment, ExperimentExtension, and ExperimentPlugin. In fact, the splitting of modules is the result of the constant trade-off between isolation and reuse, that is, the result of the combined effect of the DRY principle and the open-closed principle. Among them, BaseExperiment is the most basic module, and ExperimentExtension and ExperimentPlugin modules are the main function extension points. Through the UML class diagram, it can be seen that the business methods of the domain entities under the "congestion model" are very rich, and the self-expression ability of the model is very strong.

BaseExperiment

As its name suggests, the BaseExperiment function is the most basic and general ability of experiments, in addition to some operations of commonly used experiments and versions, it also includes experiment collections, special processing of demo experiments, email notifications, version management, indicator management, and target audiences.

Version management

Version management, as the core module of the experiment, can also be regarded as a basic ability, because the essence of the experiment is to manage a series of different differentiated configurations, and then combine online traffic to see the effect. Functions related to the experimental version can be implemented or extended in this entity, such as disabling a single experimental group, editing page information under the visualization experiment, etc. Among them, the whitelist is closely related to the version, so the whitelist-related processing logic is put under the version management module for unified processing. In addition, for special pangolin scenarios, you need to load the preset version configuration, and you need to pay attention to its versatility when designing this kind of scene.

Experimental layer

The management part of the experimental layer will be put into the experimental warehouse for maintenance in the future. At present, there are only some simple operations and verification functions for the layer in the experiment.

Metrics management

Metric management and version management are both important modules in the experimental process. In this module, it mainly deals with the addition, deletion, modification, query, and maintenance of the association relationship of indicators, such as the association relationship between experimental indicators, the association relationship between experiments and indicator groups, etc.

target audience

The target audience is also the filter condition in the experiment module, which is used to configure the route of the requested user. Since it involves a lot of business logic, the TargetRule entity is extracted separately to process this logic. In the future, it will also be responsible for the parameter conversion (backend) of filter conditions and the creation of some associated conditions, such as the association between filter conditions and groups, server-side filter parameters, etc.

/ Business Process /

Summarizing the main process of the experiment, it can be found that the operation of any experiment can be abstracted into three steps, namely, data validation, specific logical processing, and data persistence. This also provides the possibility of designing extensible and pluggable code architectures. The main process of creating an experiment is shown in the following figure, which can be roughly divided into three parts according to the function type: validator, process, and save.

The validator verifies the data, and if there is any data that does not match, it will directly return an error.
process processes business logic, including operations such as data transformation and building aggregate roots, and returns an error when there is a problem.
save is the last persistence logic, and when the data persistence error is returned, the transaction will be canceled.

In the system layering, the main functions that each module is responsible for are roughly shown in the following diagram. Different layers complete the business logic in their own way, and the following example code is the specific implementation of the corresponding domain service of the experimental opening interface.

Remove step dependencies

In the interaction of experiment creation, it usually takes a few steps to complete the creation of meta information, and in the fourth step, the experiment is converted from draft to debug. However, from the perspective of RESTful specifications, resource creation, update, and partial update (state modification) should be implemented through different operations. And if you bind the actions you create to the steps of your experiment, you will greatly limit the scalability. At present, some resource creation, update and steps have been decoupled from the KONI code, but there are still operations that are strongly bound to the steps. This experimental RPC service refactoring will completely abandon step dependencies and completely decouple from steps.

As shown in the following figure, you can choose to create all fields at once, or you can complete the construction of the overall data in multiple steps, and verify the data integrity before debugging.

In order to achieve the above functions, you need to adjust the field types in the experimental IDL, and change all fields except ID to optional fields, so that the service can obtain the fields that need to be updated in this API call.

Automatic data validation

Data verification plays an important role in business logic code, which is related to whether the entire subsequent business logic can run correctly. According to the specific business logic and scenario, the verification of parameters can be divided into four parts: field verification, dependency verification, functional verification, and logical verification.

Field validation
Common check types, such as how many characters the name of the experiment cannot exceed and whether the experiment type is legal.
Dependency checks
As the name suggests, dependency verification relies on other modules in the business logic, such as indicators, and it is necessary to verify whether the indicators are legitimate.
Functional verification
Functional verification, such as whether the user has permission on a certain resource, and configuration conflicts in the experiment.
Logical checks
Logic verification is mainly some specific business logic, such as the time range verification of the start of the child experiment and the end of the parent experiment.

First of all, let's take a look at how the usual data verification is implemented, and the following is the verification method in the old version of the feature flag on the tob side. If a request contains incomplete entities or value objects, there will be a lot of judgments about whether to set certain fields; In addition, the verification required for creation and the verification required for updating need to be handled separately and cannot be reused. Although this kind of verification can realize the verification function, the new verification fields may need to be changed synchronously in the verification functions such as creation and update, and problems will occur if there are omissions. Therefore, it is necessary to consider a mechanism that can verify on demand, and automatically build a construction function according to the set fields to complete the verification of parameters.

Normally, the validation logic is written before the formal business logic, but considering that the data has dependencies and the validation requires a complete domain model, the DataTester refactoring includes the data validation as part of the aggregation. In order to achieve the complete decoupling of business logic and data verification and better support subsequent expansion, the automatic verification mechanism is implemented through the Validator object.

In the future, we will optimize according to the actual situation, and split out some verification parameters that can be judged without external calls, so that they can be returned in time

As mentioned above, although the process of creating a draft state experiment will be divided into N steps, in the actual business code, it is expected to avoid the verification of strong binding to the steps as much as possible, so a more general way will be used, that is, to determine whether some parameters are set in the request parameters, and if so, the parameters will be verified (except for the id field in the idl, the rest have been changed to optional fields). In this case, the verification of the data is only the verification of the corresponding module data, and does not involve the verification of data such as data linkage before and after. If you need to customize some business logic validations, you can register them in the verification methods of the respective operations.

In addition to the data of each module that needs to pass the verification, it is also necessary to ensure that the dependencies of the data before and after are valid and the relevant conflict checks are passed before it can be enabled normally. Editing experimental experiments also requires the completion of the overall plausibility check. In addition, there are many state flow interfaces such as debugging, opening, freezing, and pausing in the experiment, and the function verification required for each operation may be different, but the function required for each verification is determined, so it is only necessary to register the check function on demand to achieve the separation of the use of the function definition domain.

In the specific implementation process, the Validator class is used to realize the isolation of function definition and function call, which avoids the problems of poor readability, extensibility and testability caused by the coupling of various verification functions in a function at this stage.

Build aggregates on demand

Due to the different requirements for the construction of aggregates for different requests, for example, the freezing of the experiment only needs the basic experimental information, and the closing of the experiment needs to consider the version and the information of the parent-child experiment.

Configure the module to be built through a JSON data structure, and each additional module or sub-element of the module needs to be implemented in the constructor.

func GetModuleConfigMapForExpStart() map[string]interface{} {
    return map[string]interface{}{
       constant.ApplicationModule: map[string]interface{}{
          constant.ModuleList: []string{
             constant.ApplicationInfoModule,
          },
       },
       constant.ExperimentModule: map[string]interface{}{
          constant.ModuleList: []string{
             constant.BaseExperimentModule, constant.VersionModule, constant.RunningExpListAtSameLayerModule},
       },
       constant.ExtensionModule: map[string]interface{}{
          constant.ParentChildModule: map[string]interface{}{
             constant.ParentExperimentModule: map[string]interface{}{
                constant.ModuleList: []string{
                   constant.ParentBaseExperimentModule},
             },
          },
          constant.IntelligentModule: map[string]interface{}{
             constant.ModuleList: []string{
                constant.IntelligentTrafficMapModule},
          },
       },
       constant.PluginModule: map[string]interface{}{
          constant.RolloutModule: map[string]interface{}{
             constant.ModuleList: []string{
                constant.RolloutModule,
             },
          },
          constant.ExperimentWorkflowModule: map[string]interface{}{
             constant.ModuleList: []string{
                constant.ExperimentWorkflowModule,
             },
          },
       },
    }
}

{
    "Application":{
        "ModuleList":[
            "ApplicationInfo"
        ]
    },
    "Experiment":{
        "ModuleList":[
            "BaseExperimentEntity",
            "Version",
            "RunningExpListAtSameLayer"
        ]
    },
    "Extension":{
        "Intelligent":{
            "ModuleList":[
                "TrafficMap"
            ]
        },
        "ParentChild":{
            "ParentExperiment":{
                "ModuleList":[
                    "ParentBaseExperiment"
                ]
            }
        }
    },
    "Plugin":{
        "ExperimentWorkflow":{
            "ModuleList":[
                "ExperimentWorkflow"
            ]
        },
        "Rollout":{
            "ModuleList":[
                "Rollout"
            ]
        }
    }
}

As shown in the preceding figure, the domain object built at this time consists of four parts: application information, experiment basics, and extension plug-ins. Among them, the application information is mainly commonly used app_id and product_id; The basic information of the experiment constructs the experimental version information, the traffic information of the layer where the experiment is located, and the implementation list of the same layer. The extension part needs the basic information, version information and traffic map information of the parent experiment in the parent-child experiment, and the traffic map information of the layer in the intelligent experiment part. In the plug-in module, the information of smoothing effect and experimental workflow is obtained. Each of these keys needs to have a corresponding implementation. Specifically, in order to unify the function signature, the repo implementation is encapsulated in the form of a closure, as shown below for an example of obtaining the extension module:

type BaseFactory struct {
    repoImpl   domain.IRepository
    baseExp    *base.Experiment
    moduleTree interface{}
    *builder.ParentChildBuilder
    *builder.IntelligentBuilder
    *builder.MultiRoundExperimentBuilder
    *plugin.CanaryControlBuilder
}
func NewBaseFactory(repoImpl domain.IRepository, baseExp *base.Experiment, moduleTree interface{}) *BaseFactory {
    return &BaseFactory{
       repoImpl:                    repoImpl,
       baseExp:                     baseExp,
       moduleTree:                  moduleTree,
       ParentChildBuilder:          builder.NewParentChildBuilder(baseExp, repoImpl, moduleTree),
       IntelligentBuilder:          builder.NewIntelligentBuilder(baseExp, repoImpl, moduleTree),
       MultiRoundExperimentBuilder: builder.NewMultiRoundExperimentBuilder(baseExp, repoImpl, moduleTree),
    }
}
func (b *BaseFactory) BuildExtension(ctx context.Context) (*extension.Extension, error) {
    if b.moduleTree == nil {
       return nil, nil
    }
    entityMap := make(map[string]interface{})
    for k := range b.moduleTree.(map[string]interface{}) {
       err := b.getExtensionModuleMap()[k](ctx, entityMap)
       if err != nil {
          return nil, err
       }
    }
    return extension.NewExtension(b.baseExp, entityMap), nil
}
func (b *BaseFactory) getExtensionModuleMap() map[string]func(ctx context.Context, entityMap map[string]interface{}) (err error) {
    return map[string]func(ctx context.Context, entityMap map[string]interface{}) (err error){
       constant.ParentChildModule: func(ctx context.Context, entityMap map[string]interface{}) (err error) {
          entityMap[constant.ParentChildEntity], err = b.BuildParentChildEntity(ctx)
          return err
       },
       constant.IntelligentModule: func(ctx context.Context, entityMap map[string]interface{}) (err error) {
          entityMap[constant.IntelligentEntity], err = b.BuildIntelligentEntity(ctx)
          return err
       },
       constant.MultiRoundExperimentModule: func(ctx context.Context, entityMap map[string]interface{}) (err error) {
          if !b.baseExp.NeedLoadMultiRoundExperiment() {
             return nil
          }
          entityMap[constant.MultiRoundExperimentEntity], err = b.BuildMultiRoundExpEntity(ctx)
          return err
       },
    }
}

Business logic processing

The business logic part is divided into three parts according to the division of the experimental domain.

Some of the business logic of BaseExperiment is relatively simple, and the business logic can be added as appropriate, and in principle, these entities or value objects have been created when building the aggregate.
The ExperimentExtension part mainly expands according to different experiment types, and can do their own business logic operations according to actual scenarios.
The ExperimentPlugin part is mainly an advanced function extension of the experimental granularity, which is extended through the chain of responsibilities pattern to complete the business logic under different operations in the abstract interface.

The following is illustrated with the business logic that the experiment starts. Because of the object-oriented approach to programming, the business model is also called the programming "congestion model", and each entity has a rich set of methods. As for the operation of experimental starting, it is also operated from three modules according to the above division, and each module directly or indirectly provides the Start method to complete the overall business. Specific to the extension and plugin part, the aggregation relies on a unified abstract interface, regardless of its specific implementation, whether it is a programming experiment or a visualization experiment, or adds a precise circuit breaker or experimental approval, the corresponding business logic will be completed through a unified method call.

On the other hand, since different extensions and plug-ins are isolated through interfaces, it will be very easy to add new modules or modify some modules in the future, and only need to focus on the current changes without worrying about affecting other modules.

In the process of business execution, the overall business logic presents a process from top to bottom and step by step. A function that splits layers into a single function or method of responsibility is similar to organizational management. This will make the function or method as reusable as possible, and the function or method with a single function will be easier to test. This top-down implementation is very similar to the focus on data flow in Waterloo's programming style, where the upper layer only needs to do a good job of splitting the task, and the bottom layer can be implemented according to the requirements. In the end, developers only need to focus on the implementation of a single function, which will greatly reduce the cognitive load and make it easier to avoid potential bugs.

I stumbled upon an extremely powerful programming philosophy that you should ignore the code, which is just a whole bunch of instructions for the computer to follow. Instead, you focus on the data and figure out how it flows. --Waterloo Programming Style

External service calls

Service calls

In some DataTester business scenarios, you need to call third-party dependencies to complete some operations after the business logic is executed, and for more unified processing, such as message queues that need to be sent for real-time validity, unified processing interfaces are used in domain services. For some special cases, such as the visualization implementation, when it is enabled, you need to call the circle selection simulator to create a heat map, and save the ID of the heat map to the version table, while other experimental types may not need similar external operations. In order to differentiate this business requirement, an external service invocation module is added to the domain service.

At present, there are two types of external service calls, one is differentiated processing based on business scenarios (tob or internal), and the other is differentiated processing based on experimental types, as shown in the relevant UML class diagram. If there are new business scenarios in the future, they can be expanded here.

Data persistence

The implementation of data persistence is still the implementation form in the reference architecture specification, and the transaction is placed at the domain service layer. In addition, in addition to the call to the database, the call of related dependents, such as the call between microservices, is uniformly encapsulated in the repo layer, and if you need to add a new dependency, you only need to inject the dependency to facilitate unified management.

Although this set of data can be separated to support the business scenarios and business needs of DataTester, it is not the "ultimate architecture". A lot of the nuances of differentiation still exist. With the continuous enrichment of subsequent business scenarios, the current architecture may need to be further expanded. The current architecture design reserves a certain amount of scaling space, and if a special type of experiment is added in the future, there is no way to reuse it with the current main process in the experiment creation process, and the relevant logic can be upgraded to the extension module.

Now that the architecture is refactored to the critical stage, it is impossible to carry out the technical reconstruction without the redesign of the product form and the complete unification of the product function. The refactoring results that have been achieved may be lost and lost, and the new problems arising from the architecture may not be fundamentally solved, and the merger may become "stitching".

The refactoring has now been completed and rolled out to various environments, efficiently supporting the increasingly iterative work of dozens of experimental types. In the future, the domain objects will be classified and refined to ensure that the domain objects can be used as a tool library, and new functions can be realized through different combinations and arrangements to support new business scenarios and requirements. Furthermore, the service capabilities will be further opened up through plug-ins and precipitated into the plug-in market, so as to achieve a win-win situation between the middle office capabilities and the BP business side.

Refactoring Effect:

30% increase in demand development efficiency
Approx. 50% performance improvement

The above is the implementation of the upgrade of the experiment management architecture of the Volcano Engine A/B test platform DataTester, I hope it will inspire you.

As the core product of VeDI, the Volcano Engine Digital Intelligence Platform, DataTester originates from ByteDance's long-term technology and business precipitation. At present, DataTester has served hundreds of enterprises, including well-known brands such as Midea, Get, BSH Home Appliances, and Leke Fitness. These enterprises have benefited from DataTester's scientific decision support in multiple business links, and have achieved continuous business growth and optimization.

作者:DataTester

Source-WeChat public account: ByteDance Data Platform

Source: https://mp.weixin.qq.com/s/Ca780IZraMas5PwwiHlaQA

Experimental management refactoring and DDD practice of Volcano Engine A/B test platform