完爆 Best Fit，看阿裡如何優化 Sigma 線上排程政策節約億級成本

摘要：

2018 年“雙 11”的交易額又達到了一個曆史新高度 2135 億。相比十年前，我們的交易額增長了 360 多倍，而交易峰值增長了 1200 多倍。相對應的，系統數呈現爆發式增長。系統在支撐“雙 11”過程中的複雜度和難度呈現指數級形式上升趨勢。

作為阿裡巴巴全集團範圍的容器排程系統，Sigma 在“雙11”期間成功支撐了全集團所有容器（交易線中間件、資料庫、廣告等 20 多個業務）的調配，是阿⾥巴巴運維系統重要的底層基礎設施。Sigma 已經是阿裡全網所有機房線上服務管控的核心角色，管控的主控端資源達到百萬級，重要程度不言而喻，其算法的優劣程度影響了集團整體的業務穩定性，資源使用率。

當用戶向排程系統申請容器所需的計算資源(如 CPU 、記憶體、磁盤)時，排程器負責挑選出滿足各項規格要求的實體機來部署這些容器。在相同的資源需求下，排程政策的優劣決定着叢集計算資源利用的水準。本文将簡要介紹群體增強學習算法在排程政策優化中的應用。

作者：

王孟昌達摩院機器智能技術算法專家

韓堂阿裡巴巴集團技術專家

1.計算資源排程及線上政策

當使用者向 Sigma 申請容器所需的計算資源（如 CPU、Memory、磁盤等）時，排程器負責挑選出滿足各項規格要求的實體機來部署這些容器。通常，滿足各項要求的實體機并非唯一，且水位各不相同，不同的配置設定方式最終得到的配置設定率存在差異，是以，排程器的一項核心任務就是按照某一政策從衆多候選機器中挑出最合适的實體機。

在文獻中，計算資源排程一般被表述為矢量裝箱問題(vector bin packing problem)，如果各應用的容器數量事先已知（如大促場景），排程器可一次性為所有容器生成優化的排布方案，此時問題可以表述為整數規劃，可使用通用求解器或專門開發的算法來求解；如果各應用的請求陸續到達 Sigma （如日常場景），排程器需要在每次請求到達時即時（線上）生成部署決策，此時問題可表述為馬爾可夫決策過程 (Markov Decision Process, MDP)，原則上可以通過值疊代或政策疊代求得最優政策。

最常用的排程政策包括 First-Fit (FF) 和 Best-Fit (BF)。如果使用 First-Fit算法，排程器會将容器部署到周遊中碰到的第一個滿足所有要求的實體機上；而Best-Fit算法則會在滿足要求的實體機中挑選配置設定水位最高的機器來部署容器。對于經典的 bin packing 問題（即一維矢量裝箱問題），First-Fit 和 Best-Fit 的近似比均為1.7，即二者都可保證所使用的機器數不超出最優方案的170%；對于2維及以上的矢量裝箱問題，理論上不存在有着明确近似比保證的多項式算法。當實體機的某個資源次元明顯為瓶頸而導緻其它資源次元普遍有剩餘時，其有效次元可視為1，使用 First-Fit 或 Best-Fit 一般可以取得不錯的配置設定率；而一旦瓶頸并未集中展現在同一次元，兩種政策的效果就要大打問号了。

除了資源次元上的要求，實際排程中還有容災和幹擾隔離上的考慮：比如同一應用的容器不允許全部部署到同一台實體機上，很多應用甚至每台機器上隻允許有一個執行個體；某些應用之間還存在互斥關系（如資源争搶），嚴重影響應用的性能，是以也不允許它們被部署到同一實體機上。這些限制條件的引入，使得常用政策越發水土不服了。通過人肉反複試錯，勉強扛住了多次大促建站的壓力。然而，随着各業務的擴張，線上容器的規模越來越大，資源變得越來越緊張，人肉調參的效率漸漸力不從心。

為了把排程同學從調參中解放出來，讓有限的資源扛住更大的壓力，達摩院機器智能技術實驗室（M.I.T.）的決策智能算法團隊和Sigma排程團隊展開了緊密合作，對線上排程政策問題進行了研究，并開發了基于群體增強學習（SwarmRL）的算法。

2.線上排程模型

記目前待部署容器的規格為向量 p∈P，為其配置設定資源時叢集狀态為向量 s∈S , 候選實體機的集合為 A⊆A，政策可表示為函數 π:S×P→A（π∈Π）。當按政策 π 選擇實體機 a=π(s,p)來部署該容器時，該選擇的即時成本為 r(a)，叢集的新狀态 s′ 由狀态量 s 、p 以及動作 a 共同決定，記為 s′=L(s,p,a) ；記後續到達的容器規格 p′，對于線上排程，p′ 為随機量。引入折扣系數 γ∈[0,1]，系統的 Bellman 方程為：

最優排程政策可表示為：

理論上，通過随機梯度下降，我們可以在政策空間 Π 中搜尋較優的政策，但相要更進一步的優化，甚至得到全局最優政策，則需要借助其它方法，特别是當最優政策可能是 multi-modal 形式。

3.群體增強學習 SwarmRL

為防止政策的優化陷入較差的局部最優解，同時擁有較快的收斂速度，我們基于群體增加學習的架構來設計算法。與傳統的增強學習方法相比，算法使用多個 agent 來探索問題的政策空間，且多個 agent 之間存在互相學習機制，這使得算法有了跳出局部陷阱的能力。為擷取各狀态值（V^π^）的估計，一個準确的 Sigma 模拟器必不可少，團隊内部同學基于 Sigma 的排程器開發了“完全保真”的模拟器 Cerebro 。

算法首先随機初始化一群 agent 的政策，針對每個政策，通過模拟器擷取相應的的狀态值估計，記錄目前全局最佳政策。在後續的每次疊代中，各個 agent 不斷更新自身的局部最佳政策，并參照局部最佳政策與群體目前全局最佳政策，對 agent 自身的目前政策進行更新，再進行模拟，擷取新政策的狀态值估計，更新全局最佳政策。如此循環，直到滿足收斂條件。

在各個 agent 狀态值的估計中，樣本（多個随機抽取的叢集快照和擴容請求序列）和各 agent 的目前政策被輸入模拟器 Cerebro，追蹤模拟時叢集狀态的軌迹，即可得到該軌迹的總成本；基于多個樣本的軌迹總成本求平均，即得到相應政策下的狀态估計值。

在 SwarmRL 中，政策的演進方向與步長用“速度” (v) 來表示，速度的變化涉及局部最佳政策 (πL) 和群體全局最佳政策 (πG ) 與 agent 目前政策 (π) 的差異，并受政策慣性因子 w、本地學習因子C~1~（self-learning）、群體學習因子 C~2~ (social-learning) 等參數的調控：

其中 ξ1,ξ2∈[0,1] 為随機量，Φ為可行性保持映射，用于将逸出可行域的 agent 重新“拉回”可行域。在疊代中，局部最佳政策 (πL) 和群體全局最佳政策 (πG ) 不斷更新：

4.算法應用

下面我們先用一個随機生成的小算例來對比一下算法的效果。算例中涉及 30 個應用（見下表），其容器規格主要為 4c8g 與 8c16g，所用主控端的規格均為 96c512g。

若在排程時，請求的順序和數量均為已知（“上帝視角”），即進行事後排布，使用整數規劃求得的最優解對應的配置設定率為 94.44 % （這也是所有排程政策在該算例上所得配置設定率的上界），共啟用 15 台主控端，具體排布方案為：

現實場景中，每個請求所處順序和容器數量僅在其到達 Sigma 時才揭曉，若采用 Best-Fit 進行動态排程，所得配置設定率為 70.83%，共啟用 20 台主控端，具體排布如下：

若采用 SwarmRL 學習所得政策進行動态配置設定，配置設定率為 94.44%，共啟用 15 台主控端，最終容器排布如下：

在該算例中，SwarmRL 學習所得政策的表現（94.44%）與“上帝視角”下最優排布的表現（上界）一緻，明顯優于 Best-Fit 的表現（70.83%），改進幅度達 23.61%.

我們再随機生成規模較大的請求資料：共計 3K 個請求，5K 個容器，其規格分布如下圖，

由于該場景下整數規劃模型的變量規模太大，已經無法在短時間内直接求取“上帝視角”的最優解。對比 Best-Fit （以及人肉政策），算法所得新政策的效果如下：

相對于 Best-Fit，新政策節約主控端 13 台（4.48%），配置設定率提升 4.30%；相對于人肉政策，新政策節約 7 台（2.46%）主控端，配置設定率改進 2.36%.

考慮到實際場景中應用請求到達順序的随機性，我們随機打亂請求生成多個不同的請求順序，再分别應用三個政策按不同的請求順序進行動态配置設定：

Best-Fit 在不同請求順序下主控端數量的極差為 39 台，相對人肉政策的 84 台而言，表現相對穩定，其波動幅度約為人肉政策的一半；人肉政策的平均配置設定率低至 81.85%，對比原順序下的 93.44%，可見人肉政策的性能并不穩定，表現出較劇烈的波動。而學習所得新政策的表現則相當穩定，其主控端數量的極差僅為 3 台，波動幅度約為人肉政策的 30 分之一；新政策的配置設定率平均比人肉政策的配置設定率高 13.78%，比 Best-Fit 的高 3.02%.

5.總結與展望

從提升配置設定率、節省資源的角度來看，SwarmRL 算法可以産生出優于常用（以及人肉）的政策，并且有着較為穩定的表現。算法部署到線上環境後，公共資源池的配置設定率峰值與之前相比有了明顯的提升。

随着 CPU share 和混部的鋪開，除配置設定率外，新的場景将涉及更多目标，比如打散、負載均衡等，這些目标甚至還有互相沖突的地方，而 SwarmRL 的運作機制天然适合具有多個目标的政策優化問題，可以十分友善地在政策空間中構造 Pareto Front，因而，後續我們将繼續研究新場景下的線上排程政策問題，充分挖掘 SwarmRL 的潛力，進一步提升 Sigma 的排程能力。

參考文獻

David Simchi-Levi, Xin Chen and Julien Bramel (2014). The Logic of Logistics: Theory, Algorithms, and Applications for Logistics Management (3rd ed). Springer
Richard S. Sutton and Andrew G. Barto (2017). Reinforcement Learning: An Introduction. The MIT Press
Hitoshi Iima, Yasuaki Kuroe and Kazuo Emoto (2011). Swarm reinforcement learning methods for problems with continuous state-action space, IEEE ICSMC
Yossi Azar, Ilan R. Cohen, Amos Fiat and Alan Roytman (2016). Packing small vectors. SODA'16
Yossi Azar, Ilan R. Cohen, Seny Kamara and Bruce Shepherd (2013). Tight bounds for online vector bin packing. STOC‘13

英文版（感謝谷歌：Yaxiong Zhao 翻譯）

Beating Best-Fit hands down: How Alibaba optimizes Sigma’s online scheduling policy to save hundreds of millions

Abstract:

In 2018’s Singles Day (Double 11), Alibaba once again set a new record GMV totaled 213.5 billion yuan. Compared to a decade ago, our GMV has increased by more than 360 times, and the peak transaction rate has increased by more than 1,200 times. To support such explosive growth, the number of systems have increased exponentially; their complexity and the difficulty of working with these systems have been and are still increasing exponentially.

Sigma is Alibaba's foundational container orchestration system supporting all Alibaba Group subsidiaries, and a critical component in Alibaba’s production infrastructure. Sigma schedules and manages all containers of the entire Alibaba Group (transaction line middleware, Database, ADs, etc.) during the Singles Day. Sigma is the core for managing all Alibaba’s online service in all data centers, controls millions of physical servers. Its algorithms determine the stability of Alibaba’s business operations and the utilization of hardware resources. Its importance is self-evident.

When a user requesting resources (such as CPU, RAM, disk) from Sigma to run containers, it is the Sigma scheduler that selects the physical servers. For the same resource requirements, the quality of the scheduling policy determines the resource utilization. This paper introduces the application of swarm reinforcement learning in optimizing the scheduling policy.

Author:

**Wang Mengchang: Dharma Institute Machine Intelligent Technology, Algorithm Expert

Han Tang: Alibaba Group, Technical Expert**

1. Computing resource scheduling and online scheduling policy

When a user requesting resources (such as CPU, RAM, disk) from Sigma to run containers, it is the Sigma scheduler that selects the physical servers. There are often more servers that meet the requirements than the requested amount. The current utilizations on these machines vary. The resulting utilizations of different scheduling policies vary as well. Therefore, one of the core functionalities of the Sigma scheduler is to select the most suitable physical server from a number of candidates according to a certain policy.

In the literature, computing resource scheduling is generally modeled as a vector bin packing problem. If the number of containers and their requirements are known in advance (such as a large promotion event), the scheduler can generate the optimal placement for all containers at once. In this case, the problem can be modeled as an integer programming problem, which can be solved using a general-purpose solver or a specially developed algorithm. If the scheduling requests to Sigma comes continuously (such as normal daily operation), the scheduler must instantaneously compute the placement for each request (online decision making). In this case, the problem can be modeled as a Markov Decision Process (MDP). In principle, the optimal scheduling policy can be obtained through value iteration or policy iteration.

The most common scheduling policies include First-Fit (FF) and Best-Fit (BF). In First-Fit, the first physical server that meet all the requirements of a container is selected; in Best-Fit, the machine with the highest utilization is selected. For the classic bin packing problem (ie, one-dimensional vector packing problem), the approximate ratio of First-Fit and Best-Fit is both 1.7, that is, both can ensure that the number of machines required does not exceed 170% of the optimal solution. For the vector packing problem of 2D and above, there is no polynomial algorithm with a fixed approximation ratio. When there is only one resource dimension that is the bottleneck in the fleet and other resource dimensions are generally abundant, the effective dimension can be regarded as 1. Using First-Fit or Best-Fit can usually achieve a good utilization. In other situations, the quality of the these two policies will be a big question mark.

In addition to optimizing the resource utilization, scheduling policies also consider other restrictions like fault-tolerance and isolation. For example, all of the containers of an application are not allowed to be deployed on the same physical machine, and many applications only allow at most one instance per physical machine. There can also be conflicts between applications (such as resource contentions), which seriously affect the performance of the applications when they are running on the same machine; and therefore they are not allowed to be run on the same machine. These restrictions further make the vanila policy inappropriate in practice. Sigma barely withstands the pressure of several large promotion events, through repeated trial and error of manual parameter tuning. However, Alibaba’s aggressive business expansion results into rapid-growing in the scale of the container deployment, the resource contention have become more and more severe; and the velocity of manual parameter-tuning are gradually failing behind.

In order to free our colleagues working on Sigma scheduling from the tedious manual parameter tuning, and let the limited resources be utilized more efficiently, the Machine Intelligence Lab of DAMO Academy and the Sigma team have started a collaboration. We studied the optimal policy for online scheduling, and developed an algorithm based on swarm reinforcement learning (SwarmRL).

2. Online scheduling model

The current specification of the container to be deployed is a vector p∈P. When the resource is allocated, the cluster state is vector s∈S, the set of candidate physical machines is A⊆A, and the policy can be expressed as function π: S×P→A(π ∈Π). When the container is deployed by selecting the physical machine a=π(s,p) according to the policy π, the immediate cost of the selection is r(a), and the new state of the cluster s' is determined by the previous state s, container specifications p, and the action a. It is denoted as s'=L(s,p,a); let p’ be the container specification that arrives later, for online scheduling, p' is a random quantity. Introducing the discount coefficient γ∈[0,1], the Bellman equation of the system is:

The optimal scheduling policy can be expressed as:

In theory, we can use stochastic gradient descent to find optimized policies in the policy space. However, if we were to further optimize or even get the global optimal policy, we need to use other methods, especially when the optimal policy were Multi-modal.

3. Swarm Reinforcement Learning: SwarmRL

In order to avoid inferior local optimums and achieve fast convergence speed, we designed the algorithm based on Swarm Reinforcement Learning (SwarmRL). Compared with traditional reinforcement learning methods, SwarmRL employs multiple agents to explore the policy space of the optimization problem, and there are mutual learning mechanisms among multiple agents, which allows the learning process to jump out of local optimums. To accurately estimate each state value (V^π^), an high-fidelity Sigma simulator is critical. We developed an full-fidelity simulator called Cerebro, based on Sigma scheduler.

The algorithm first randomly initializes a group of agent policies. For each policy, the algorithm obtains the initial state value estimates from simulations and records the current global optimal policy. In each subsequent iteration, each agent continuously updates its local optimal policy, and the current effective policy in reference to the local optimal policy and the current global optimal policy of the group. The above process is repeated until the convergence condition is met.

To obtain the state estimate of a given policy for each agent, the sample (multiple randomly extracted snapshot of the cluster state and pending container requests sequence) and the agent’s current policy are fed to the simulator Cerebro, and the trajectory of the cluster state during the simulation is tracked to obtain the total cost of the trajectory. Averaging the total cost of the trajectories of multiple samples, the resultant average is the state estimate of the corresponding policy.

In SwarmRL, the direction and step size of the evolution of the policy are expressed as “speed” (v), which involves the difference between the local best policy (πL) and the group global best policy (πG) and the agent current policy (π). And regulated by parameters such as strategic inertia factor w, local learning factor C~1~(self-learning), group learning factor C~2~ (social-learning):

Where ξ1, ξ2∈[0,1] are random variables, and Φ is a feasibility-preserving map, which is used to “pull back” the agent that escapes the feasible domain back to the feasible domain. During iterations, the local optimal policy (πL) and the group global optimal policy (πG) are constantly updated:

4. Algorithm application

Let's first compare various algorithms in a small experiment with randomly generated inputs. The experiment involves 30 applications (see table below). The container specifications are mainly 4c8g (4 CPU and 8GB RAM) and 8c16g, and the specifications of the host used are 96c512g.

If at the time of scheduling, the order and spec of requests are known ("God perspective"), that is, an after-fact scheduling, the utilization of the optimal solution produced by integer programming is 94.44% (this is also the upper bound of the utilization of all scheduling policies). This solution requires a total of 15 machines. The actual placement is as follows:

In reality, the order and spec of each request are only revealed when they arrive at Sigma. The Best-Fit policy achieved a utilization of 70.83%, and a total of 20 hosts are required. The actual placement is as follows:

The policy learned from using SwarmRL achieved a utilization of 94.44%, and a total of 15 hosts are required. The actual placement is as follows:

In this experiment, the performance of the learned policy (94.44%) and that of from the "God perspective" (upper bound) is consistent, much better than the Best-Fit (70.83%), the improvement is 23.61%。

We then randomly generated larger data request: 3K requests, 5K containers. The requirements distribution is as follows:

It is impossible to obtain the theoretical upper bound for this input, because the size of the Integer Programming is too large to solve in a reasonable small time. Here we omit the theoretical upper bound. Compared to Best-Fit (and manually tuned policy), the improvements of the learned policy is as follows:

Compared with Best-Fit, the learned policy saved 13 machines (4.48%), and the utilization is increased by 4.30%. Compared with the manually tuned policy, the learned policy saved 7 of machines (2.46%), and the utilization is increased by 2.36%.

Considering the random order of the application request arriving in real-world scenarios, we randomly reordered requests to generate multiple different request sequences, and then apply three policies to dynamically allocate machines:

Best-Fit has a small spread, 39 machines, in the number of required hosts from different input orders of the same set of requests. This is roughly half of that of the manually tuned policy, which is 84 machines. The average utilization of the manually-tuned policy is as low as 81.85%. Compared with 93.44% in the previous experiment, its performance is not stable and shows relatively severe fluctuations. The performance of the learned policy is quite stable. The spread of the number of required machines is only 3, and the fluctuation range is about one-third of that of the manually tuned policy. The utilization of the learned policy is 13.78% higher than that of the manually tuned policy, 3.02% higher than Best-Fit.

5. Summary and outlook

From the perspective of increasing the utilization and saving resources, SwarmRL can learn policies that perform better than common and manually-tuned policies; and the learned policies have a relatively stable performance. After the learned policy is deployed to production, the peak utilization of the public resource pool is significantly improved.

With the increasing adoption of CPU sharing and mixed online & offline workloads, new application scenarios will require optimizing for more objectives in addition to utilization, such as spread scheduling, load balancing, etc. These objectives are even contradictory in certain aspects. SwarmRL is naturally suitable for optimizing the policy for multiple objectives, as multiple objectives can be easily constructed as Pareto Front in the policy space. Therefore, we will continue to study the online scheduling policy problem in the new scenarios, fully realize the potential of SwarmRL, and further improve Sigma’s scheduling capability.

References

● David Simchi-Levi, Xin Chen and Julien Bramel (2014). The Logic of Logistics: Theory, Algorithms, and Applications for Logistics Management (3rd ed). Springer

● Richard S. Sutton and Andrew G. Barto (2017). Reinforcement Learning: An Introduction. The MIT Press

● Hitoshi Iima, Yasuaki Kuroe and Kazuo Emoto (2011). Swarm reinforcement learning methods for problems with continuous state-action space, IEEE ICSMC

● Yossi Azar, Ilan R. Cohen, Amos Fiat and Alan Roytman (2016). Packing small vectors. SODA'16

● Yossi Azar, Ilan R. Cohen, Seny Kamara and Bruce Shepherd (2013). Tight bounds for online vector bin packing. STOC‘13

完爆 Best Fit，看阿裡如何優化 Sigma 線上排程政策節約億級成本

1.計算資源排程及線上政策

2.線上排程模型

3.群體增強學習 SwarmRL

4.算法應用

5.總結與展望

參考文獻

繼續閱讀

Codeforces 1417 D. Make Them Equal(思維+構造)

查找算法之二分查找查找算法之二分查找

查找算法學習之二分查找（Python版本）——BinarySearch

CQ V1.0分詞bates(基于雙數組tire樹)—應該是目前最快的中文分詞算法

Command Network(POJ 3164)---定根最小樹形圖模闆題題目描述輸入格式輸出格式輸入樣例輸出樣例分析源程式

開源低帶寬語音編解碼器

241 Different Ways to Add Parentheses（C代碼版）

【趨高機器視覺】機器視覺技術原了解析及解決方案

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

極大似然法(ML)與最大期望法(EM)

C++ 第十五周報告1--《冒泡法排序》

筆試面試題目：滑動視窗(二)

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

hdu7108哈希