laitimes

An article to understand the industry in offline mixing technology

Author | Shu Chao

Edit | Cai Fangfang

Preface

In the past 2021, under the superposition of macro trends such as the slowdown in global economic growth, the rise and fall of the epidemic, the continuous friction between China and the United States, and the tightening of national platform supervision, many Internet manufacturers have suffered a significant decline in market value and increased losses, and the news of layoffs has been heard from time to time, so in 2022, cost reduction and efficiency increase will undoubtedly further become the trend of the industry.

Under the premise of keeping the business form and investment unchanged, an obvious way to reduce costs and increase efficiency is to improve the utilization rate of existing resources, and the reasons for the low utilization of resources are mainly as follows:

Extensive resource evaluation: R&D pays more attention to how to quickly and stably iterate product requirements, so when the service is deployed, the resources required for the service are generally estimated according to the maximum traffic. However, most online services have obvious tidal characteristics, resulting in low resource utilization (less than 10%) for most of the time period, resulting in waste.

Cluster resource integration is not high: the resource consumption of the server is often unbalanced, such as online services, especially the fan-out node service on the main link, and the peak period often presents a situation where CPU and bandwidth are tight, but the memory is more than enough. This results in a lack of memory redundancy, but still unable to aggregate an equal proportion of other idle resources to form meaningful computing entities.

Service deployment isolation: Because of the large difference in the cost of the eastern and western computer rooms and the problems of capacity planning, many enterprises will completely isolate the online computer room and the offline computer room, so that the offline operations between different AZs and even different regions cannot be integrated, and the resource pool cannot be interconnected.

As an effective solution to improve resource utilization and reduce costs, the offline mixed part technology has been unanimously recognized and recommended by the industry.

What is in the offline mix

An enterprise's IT environment typically runs two broad types of processes, one for online services and one for offline jobs.

Online services: long running time, service traffic and resource utilization have tidal characteristics, latency sensitivity, and high requirements for service SLAs, such as message feed services, e-commerce transaction services, etc.

Offline jobs: Between running time partitions, the resource utilization is high during the run, the latency is insensitive, the fault tolerance rate is high, and the interruption generally allows reruns, such as MapReduce and Spark jobs under the Hadoop ecosystem.

Because the utilization rate of online service resources has more obvious undulating characteristics, the main scenario of mixed department is to utilize the idle resources of online services in various periods by filling in offline operations, reducing the increasing costs of enterprises. (Note: Off-Line Mixing Plan is described separately)

An article to understand the industry in offline mixing technology

Figure 1 Mixed-part schematic

The cost value of the offline mix

In order to more graphically understand the cost value of offline mixing, let's look at a small and medium-sized enterprise, 4 core 8G machines have a total of 1000 units, the main computing resources are 4000 cores, 8000G. Assuming that the average resource usage per machine is 10%, the actual compute resources used are 4000*10% = 400 cores, and 8000*10% = 800G. If we can increase resource utilization to 20% through mixing, then we only need 500 machines. Assuming that the average price of cpu is 300 yuan / core / year, and the average price of memory is 180 yuan / G / year, you can save 2000 * 300 + 4000 * 180 = 132w yuan / year.

It can be seen that the cost value in the offline mix is clear and calculatable and the benefits are huge.

In industry practice, Google uses mixed-part technology to increase resource utilization from 10% to 60%, saving hundreds of millions of dollars per year. Alibaba and other large factories have also successfully increased the utilization rate of resources by more than 3 times with the help of mixing, and the cost savings are considerable.

Technical thresholds in offline mixing

Although there is obvious cost value in the offline mixed part, at present, there are only some large factories in the head that really land in the production environment. The reason is mainly that the offline mixed department involves service observation, dispatch deployment, disaster recovery and governance and other aspects of the underlying technical problems, and even includes non-technical problems such as organizational cost accounting and cross-departmental collaboration, which has a high implementation threshold. To sum up, there are roughly the following challenges:

Observability system

Observability is simply the ability to measure the state of the system's internal state by examining the output of the system. From the perspective of specific output items, it generally includes metric, trace, and log, which is the cornerstone of healthy operation of the system. In the pursuit of higher resource utilization in the offline mix, it is necessary to make decisions with the help of feedback from real-time indicators. However, in the era of distributed and cloud-native, observability encounters the following obstacles:

The cloud-native system determines that the service capability and service scale are dynamically adjusted at any time, so the cost of data collection and transmission on the end is greatly increased, and in extreme cases, it even interferes with the performance of the service itself.

For observable output to form decision meaning, it needs to be combined, fitted, modeled, and other operations based on some dimensions, including a series of analysis actions such as using decision trees to AI learning. In the context of large service volume and real-time changes, the analysis delay and accuracy of observable output are facing great challenges.

Visualization of observability and extended correlation analysis (BI reports, etc.) require deep customization according to business forms and needs, are highly complex, and lack directly available tools and means.

Scheduling decisions

The scheduling decision in the offline mix is the core of determining the effect of the mix, and there are mainly several decision-making methods:

Time-sharing reuse of the whole machine: Run offline operations at a fixed time point (such as after midnight), and give up resources to online services during the day. This mixed-part method of time dimension segmentation is relatively simple and easy to understand, but the overall resource utilization improvement is limited.

Partial sharing of resources: The resources of a single machine are divided into online resources, offline resources, and offline shared resources, and the resources are isolated and reserved in advance. This mixed-part method of splitting from the single machine resource dimension is relatively more delicate than time-sharing multiplexing, but it is meaningful to require machine segmentation with large resource specifications.

Full sharing of resources: Through timely and accurate resource forecasting, the ability to respond quickly to changes in resources, and a set of service assurance measures that can change the water level of resources, the reuse of machine resources is realized more efficiently and automatically. Resource attribution is not preset, and decisions are made entirely on real-time metrics.

The former is a static decision-making, which has relatively low requirements for the underlying observability system and high availability and high performance for the dispatching system. The latter two are dynamic decisions, which are better than static decisions in improving resource utilization, but also have higher requirements for the aforementioned support systems.

Schedule execution

Due to the difference in the working mode of online services and offline operations, it is often necessary to use different schedulers for scheduling (such as K8s and Yarn). In the mixed scenario, the online scheduler and the offline scheduler are deployed in the cluster at the same time, when the resources are relatively tight, the scheduler will have a resource conflict, and can only be retried, at which time the throughput and scheduling performance of the scheduler will be greatly affected, and eventually affect the scheduling SLA. At the same time, for large-scale batch scheduling scenarios, the native K8s only support the scheduling of online services, which brings transformation costs.

Resource isolation

The essence of a container is a restricted process, with processes isolated by namespaces and resource limits by cgroups. In the cloud-native era, most of the business resources are isolated and restricted based on containers, but in the scenario of overselling and overselling mixed parts of resources, there may still be competition for CPU, memory, etc.

For example, in terms of CPU, in order to ensure the stability of online services, the common practice is to bind cores and bind online services to a logical core to avoid other business occupations. However, the tethered kernel is not friendly to services that require parallel computing, and the number of cores directly determines the parallel efficiency.

In terms of memory, offline jobs tend to read a large amount of file data, resulting in the operating system making page caches, while the native operating system's management of page caches is global, not container-dimensional.

Resource security in the event of a mission conflict

Online services and offline jobs belong to different types of work, the two loads are deployed on the same node, there will be resource interference, when the resources are tight or the traffic is sudden, the online services will be interfered with by offline jobs in the use of resources. The most important goal of offline mixing is to maximize the utilization of stand-alone resources while maintaining SLAs for online services and offline operations.

For online services, it is necessary to ensure that the indicator fluctuations of their services are controlled within 5% during peak traffic periods and before mixing;

For offline jobs, you can't be hungry or frequently evicted because the priority is not as good as online services, affecting the total uptime and SLA of the offline job.

Service parallel scaling capability

If multiple services are mixed into a machine or container, downtime may affect a dozen or even dozens of services, which requires smooth and rapid scaling capabilities of services to achieve minute-level service migration. In particular, stateful services in the storage class even involve the transformation of the memory-computing separation architecture, which brings a series of problems including data consistency and response delay.

Department wall

Within the enterprise, the product line of the machine is generally fixed, and the cost and utilization rate are also calculated according to the product line, so usually the machine will not flow freely between different departments. After the introduction of offline mixing, it is bound to break the department wall, and there is an adjustment that can be integrated and decomposed for cost and utilization calculations, in order to accurately reflect the huge cost value of the mixed department and continue to refine operations. The following is an exploded chart of a department of Meituan after refining the cost operation:

An article to understand the industry in offline mixing technology

Figure 2 Cost metric exploded view

The industry analyzes the offline mixed part scheme

Scenario splitting

Through the analysis of the current industry's offline solutions, we can abstract out the three division dimensions of the offline mixed solutions:

From the isolation type of of offline mix, it can be divided into exclusive kernel and shared kernel, which mainly depends on whether the service kernel of the mix is independent. If the service is mixed on the same physical machine, it belongs to the shared kernel; if it belongs to different physical machines, it belongs to the exclusive kernel.

From the deployment base in the offline mixed part, it can be divided into physical machine deployment and container deployment.

From the scheduling decision of offline mixed department, it can be divided into static decision making and dynamic decision making. The criterion is whether the elements on which the scheduling decision depends depend on real-time metrics during the run. If so, it is a dynamic decision, and vice versa, it is a static decision. Dynamic decision-making resource utilization is higher, but it is necessary to do a good job of resource protection in case of emergencies.

The combination of these three dimensions, in the current practical application, is mainly the exclusive kernel + physical machine + static decision, exclusive kernel + container + dynamic decision, shared kernel + container + dynamic decision these three modes.

Exclusive kernel + physical machine + static decisions

This combination is an entry-level choice in offline mixing, such as physical machines running services and time-sharing machine movements.

The advantage is that it can quickly realize the dividend of offline mixing and reap the dividend of reduced cost. In terms of technical thresholds, this method avoids the complex resource isolation problem mentioned above, and the scheduling decision is clear enough, the service deployment rhythm has a clear expectation, and the entire system can be designed to be relatively simple.

The disadvantage is that the resource utilization potential in the offline mixed department is not fully utilized, and it is mainly some start-ups that are currently in application. In the early days of Alibaba, during the promotion period, all offline work nodes were replaced by online services, which can also be seen as an approximate version of this form.

Exclusive kernel + container + dynamic decision-making

Under this model, the service developer deploys the service on the cloud-native deployment platform, selects certain metrics (mostly accompanied by traffic tidal characteristics) to measure the service load, and the platform scales the number of service nodes according to the business-specified rules. When the traffic peak period comes, with the reduction of the number of business nodes, the online service will release a large number of fragmented resources, and the deployment platform will defragment the fragmented resources, reduce the fragmented resources into whole pieces, and lease the resources to offline operations in the form of a whole machine.

Typical is the byteDance scheme, the architecture diagram is as follows:

An article to understand the industry in offline mixing technology

Figure 4 ByteDance architecture diagram in offline mixing

ByteDance relies on K8s and business quota to do the whole machine in the offline mix, in the form of cluster transfer nodes to improve the overall resource utilization, the main implementation ideas are:

When the trough of online services comes, almost all services will reduce the number of copies due to elastic capacity reduction. On the whole, the pods on the nodes in the cluster will become very sparse, and the water level of the overall resource deployment of the cluster will also decrease. When the deployment level of the cluster falls below the set threshold, the control plane will select some online nodes through certain rules, evict the pods on the node to other nodes, mark the node as unschedulable, and finally add the node to the offline cluster through a mechanism to realize the transfer of resources. Similarly, when the peak of the online service comes, a reverse control process occurs, and the control surface, after notifying and ensuring the withdrawal of offline tasks, re-adds the node back to the online cluster to set it to a schedulable state to achieve resource recovery.

ByteDance's solution is based on the company's own business complexity, using custom quotas instead of using the K8s common hpa; the deployment method is a two-stage mix: offline job machines are used for online services during the day, and online servers are transferred to offline job use at night; only container deployment is supported, and the solution is not open source.

It can be seen that in the exclusive kernel + container + dynamic decision-making scheme, the service formulates scaling rules while deploying the service, and when the traffic is trough, the platform reduces the number of service nodes according to the rules, and the online resources release fragmented resources. When the fragmented resource reaches the threshold, the online to offline transfer logic is triggered, in which the fragmented resource is first integrated, and then the fragmented resource is integrated into a physical machine, and finally the physical machine is leased to offline use as a whole. When the traffic gradually recovers, the online recycling transfer resource logic is triggered, under which the offline task is gradually expelled and the resources are reclaimed. Due to the strong tidal characteristics of online services, a specified number of offline resources can be transferred to the online service by means of timing and quantification, such as 19:00 -23 in the evening peak, and returned to the offline use at 0:00.

Shared kernel + container + dynamic decision

The biggest difference from the above scenarios is that the rules for transferring resources are made dynamically. In a large enterprise with tens of thousands of services, it is difficult to require all online services to make scaling decisions. At the same time, when deploying services, services pay more attention to service stability, and often evaluate resources according to the maximum traffic, which leads to a large amount of waste of resources during the low peak period of traffic.

More typical is the solution of Baidu, Tencent, and Kuaishou, here is the Tencent solution as an example:

An article to understand the industry in offline mixing technology

Figure 4 Tencent Caelus system architecture diagram

Tencent's offline mixed-part system Caelus relies on K8s to deploy offline tasks in the form of containers on K8s nodes, realizing the transfer of resources from online service nodes to offline operations, and supporting features include: task rating / scheduling enhancement / resource reuse / resource profiling / memory separation / task avoidance / interference detection. Tencent's solution is compatible with the Hadoop ecosystem, but does not support offline operations to transfer resources to online services. The solution has been applied to advertising, storage, big data, machine learning and other businesses on a large scale within Tencent, improving resource utilization by an average of 30% and saving hundreds of millions of costs. Offline jobs with the hadoop class of mixed part increased CPU usage by approximately 60%.

There are two resource perspectives for scenarios that share kernel + container + dynamic decisions:

From the perspective of online service resources, we can see the overall capacity of node resources, such as the total 126-core CPU on the current physical machine;

From the perspective of offline job resources, you can see the idle load of the node, such as the current physical machine and 64 core CPU is idle;

When the service is deployed:

Online services are scheduled according to resource capacity;

Offline jobs are scheduled according to node load;

The difficulty of implementing such a model lies in resource isolation, and how to avoid or reduce the impact of offline on the line is the key to the success of the mixed-part approach. Major manufacturers have made a lot of efforts in resource isolation, such as more comprehensive index collection, more intelligent load prediction, more reasonable online resource redundancy, and more refined eBPF. Even if a lot of efforts have been made in terms of resource isolation, it is still difficult to avoid the impact offline on the line in the shared kernel model, and the solution is generally accompanied by interference detection, and when the online service is detected to be affected, the stop loss is stopped in time.

A fast implementation of open source in offline mixing

From the above analysis of the offline mixed part solution, it can be seen that if there is a relatively strong research and development strength, it can better solve almost all the technical thresholds mentioned in the third part, and it can challenge the scheme of sharing kernel + container + dynamic decision combination to pursue the ultimate resource utilization and cost optimization effect. If the company's R&D team has less accumulation of underlying technology, and wants to use the offline mix quickly, safely and cheaply, and enjoy the cost optimization dividend of some mixes first, then the exclusive kernel + container + dynamic decision combination scheme is preferred.

Combined with the actual scenarios of most enterprises in the industry, we have launched a one-stop offline mixed-department solution, which is composed of two core components of the hashrate scheduling engine BridgX and the intelligent O&M engine CudgX, as shown in the following figure:

An article to understand the industry in offline mixing technology

Figure 5 Architecture diagram of an open source highly available mixed-part scheme

BridgX: Hash rate scheduling engine that provides basic hash rate scheduling capabilities for offline mixed departments at the hashrate level, including cross-cloud scheduling capabilities, K8s containers and large-scale bare metal server cutting capabilities, and fast auto scaling capabilities.

CudgX: Intelligent O&M engine that provides automated stress measurement, service index measurement, capacity evaluation and other capabilities for offline mixed departments, and accurately portrays the resource usage profile of offline operations.

The overall workflow is as follows:

CudgX is responsible for collecting service metrics and maintaining redundancy of services and nodes by configuring redundancy rules. When the traffic peaks, CudgX reduces the capacity of the service node and triggers the offline integration module transfer logic. When the traffic is peak, CudgX expands the capacity of the service node, triggering the recycling logic of the offline integration module.

At the same time, CudgX will also evaluate the redundancy of online resources, when the redundancy of online resources is too low, CudgX will trigger the K8s scaling logic, with the help of BridgX application resources, complete the online resource expansion. When the redundancy of resources returns to normal, the resources are returned to ensure that the cost is controllable when the resources are sufficient.

The offline integration module is responsible for completing the transfer and recycling logic, when the offline integration module finds that there are enough fragmented resources in the online resources to be recovered, it will carry out a fragmentation resource collation statistics, and integrate the fragmented resources into a complete physical machine and transfer them to the offline operation. When the offline integration module finds that the redundancy of online resources is insufficient, the resource recovery logic is triggered, and the resources transferred to the offline job are recycled to complete the recycling logic.

This solution provides resource isolation through the kernel, which has the advantage that the offline service is completely separated, and there is no possibility that offline operations can affect the online service.

GitHub Links:

Bridgx:https://github.com/galaxy-future/bridgx

Cudgx:https://github.com/galaxy-future/cudgx

Summary

In the offline mixing department, it is recognized to have a recognized obvious effect on resource utilization and cost reduction, but in the offline mixing department, it is a large and complex project, involving multiple components and the collaboration of multiple teams. Mixing offline is also a continuous optimization process. The big factories have invested a considerable amount of time in research before they began to spread out. We hope that by learning from the industry's mature mixing schemes, we can make offline mixing in more enterprises land in a cloud-native low-threshold one-stop way, help business get rid of cost troubles, and help enterprises succeed!

About the Author:

Shu Chao, currently serving as CTO of Xinghan Future, is responsible for the overall research and development of the company. From 2010 to 2015, he worked as a technical expert at Tencent, responsible for Microblog micro-groups, news flow advertising and other projects, and from 2015 to 2021, he served as the head of the basic R&D team, the architect of the storage center, and a senior technical expert at Meituan, responsible for the research and development and evolution of Meituan's corporate-level cloud native service governance system.

Upcoming Events

In order to further carry out in-depth exchanges on offline mixing technology, Xinghan Future will hold the second Meetup of Beijing Station at 1:30 pm on January 23, bringing the following sharing around "Automatic Scaling & Core Technology in Offline Mixing":

Xinglong, senior R&D engineer of the cloud native team of Baidu Infrastructure Department, will bring you the theme sharing of "Baidu Cloud Native Mixed Technology Decryption"

Shu Chao, CTO of Xinghan Future, will bring you "How to Achieve Sharing in Offline Mixed Department with the Help of Automatic Scaling"

Hu Zhongxiang, CPO of Xinghan Future, will bring you the detailed product experience of CudgX, an intelligent O&M engine that supports universal Web auto-scaling.

Read on