天天看點

Paper:《CatBoost: unbiased boosting with categorical features》的翻譯與解讀(一)

CatBoost: unbiased boosting with categorical features》的翻譯與解讀

原論文連結

https://arxiv.org/pdf/1706.09516.pdf https://dl.acm.org/doi/pdf/10.5555/3327757.3327770

作者 Liudmila Prokhorenkova1,2 , Gleb Gusev1,2 , Aleksandr Vorobev1 , Anna Veronika Dorogush1 , Andrey Gulin1 1Yandex, Moscow, Russia 2Moscow Institute of Physics and Technology, Dolgoprudny, Russia {ostroumova-la, gleb57, alvor88, annaveronika, gulin}@yandex-team.ru

Abstract

This paper presents the key algorithmic techniques behind CatBoost, a new gradient  boosting toolkit. Their combination leads to CatBoost outperforming other publicly  available boosting implementations in terms of quality on a variety of datasets.  Two critical algorithmic advances introduced in CatBoost are the implementation  of ordered boosting, a permutation-driven alternative to the classic algorithm, and  an innovative algorithm for processing categorical features. Both techniques were  created to fight a prediction shift caused by a special kind of target leakage present  in all currently existing implementations of gradient boosting algorithms. In this  paper, we provide a detailed analysis of this problem and demonstrate that proposed  algorithms solve it effectively, leading to excellent empirical results.

本文介紹了CatBoost(一種新的梯度增強工具包)背後的關鍵算法技術。它們的結合使得CatBoost在各種資料集上的品質優于其他公開可用的提升實施。CatBoost中引入的兩個關鍵算法改進是有序增強(ordered boosting)的實作,這是一種排列驅動的替代經典算法,以及一種處理類别特征的創新算法。這兩種技術都是為了應對目前所有梯度增強算法實作中存在的一種特殊類型的目标洩漏所造成的預測偏移。在本文中,我們對這一問題進行了詳細的分析,并證明所提出的算法有效地解決了這一問題,并取得了良好的實證結果。

1 Introduction

Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a  variety of practical tasks. For many years, it has remained the primary method for learning problems  with heterogeneous features, noisy data, and complex dependencies: web search, recommendation  systems, weather forecasting, and many others [5, 26, 29, 32]. Gradient boosting is essentially a  process of constructing an ensemble predictor by performing gradient descent in a functional space.  It is backed by solid theoretical results that explain how strong predictors can be built by iteratively  combining weaker models (base predictors) in a greedy manner [17].  

We show in this paper that all existing implementations of gradient boosting face the following  statistical issue. A prediction model F obtained after several steps of boosting relies on the targets  of all training examples. We demonstrate that this actually leads to a shift of the distribution of  F(xk) | xk for a training example xk from the distribution of F(x) | x for a test example x. This  finally leads to a prediction shift of the learned model. We identify this problem as a special kind of  target leakage in Section 4. Further, there is a similar issue in standard algorithms of preprocessing  categorical features. One of the most effective ways [6, 25] to use them in gradient boosting is  converting categories to their target statistics. A target statistic is a simple statistical model itself, and  it can also cause target leakage and a prediction shift. We analyze this in Section 3.  

梯度增強是一種功能強大的機器學習技術,可以在各種實際任務中獲得最先進的結果。多年來,它一直是學習具有異質特征、噪聲資料和複雜依賴關系問題的主要方法:網絡web搜尋、推薦系統、天氣預報等[5,26,29,32]。梯度提升本質上是一個通過在函數空間中執行梯度下降來構造集合預測器的過程。它得到了可靠的理論結果的支援,解釋了如何通過以貪婪的方式疊代地組合較弱的模型(基礎預測器)來建構強大的預測器。

我們在這篇文章中表明,所有現有的梯度提升的實作面臨以下統計問題。經過幾步增強得到的預測模型F依賴于所有訓練示例的目标。我們證明,這實際上導緻訓練示例xk的F(xk) | xk的分布從測試示例x的F(x) | x的分布轉移。這最終導緻學習模型的預測轉移。在第四節中,我們将此問題定義為一種特殊的目标洩漏。此外,在分類特征預處理的标準算法中也存在類似的問題。在梯度提升中使用它們的最有效方法之一[6,25]是将類别轉換為它們的目标統計資訊。目标統計本身是一個簡單的統計模型,它也可能導緻目标洩漏和預測偏移。我們将在第3部分對此進行分析。

In this paper, we propose ordering principle to solve both problems. Relying on it, we derive  ordered boosting, a modification of standard gradient boosting algorithm, which avoids target  leakage (Section 4), and a new algorithm for processing categorical features (Section 3). Their  combination is implemented as an open-source library1  called CatBoost (for “Categorical Boosting”),  which outperforms the existing state-of-the-art implementations of gradient boosted decision trees —  XGBoost [8] and LightGBM [16] — on a diverse set of popular machine learning tasks (see Section 6).

在本文中,我們提出了排序原則來解決這兩個問題。在此基礎上,我們得到有序增強,對标準梯度增強算法的一種改進(它避免了目标洩漏)(第4節)以及一種用于處理分類特征的新算法(第3節)。他們的組合是實作為一個名為CatBoost(用于“類别提升”)的開源庫,它優于現有最先進的基于梯度的決策樹XGBoost[8]和LightGBM [16]基于各種不同的流行機器學習任務(見第6節)。

2 Background

Assume we observe a dataset of examples D = {(xk, yk)}k=1..n, where xk = (x  1  k  , . . . , xm  k  ) is a  random vector of m features and yk ∈ R is a target, which can be either binary or a numerical  response. Examples (xk, yk) are independent and identically distributed according to some unknown  distribution P(·, ·). The goal of a learning task is to train a function F : R  m → R which minimizes  the expected loss L(F) := EL(y, F(x)). Here L(·, ·) is a smooth loss function and (x, y) is a test  example sampled from P independently of the training set D.  

A gradient boosting procedure [12] builds iteratively a sequence of approximations F  t  : R  m → R,  t = 0, 1, . . . in a greedy fashion. Namely, F  t  is obtained from the previous approximation F  t−1  in  an additive manner: F  t = F  t−1 + αht  , where α is a step size and function h  t  : R  m → R (a base  predictor) is chosen from a family of functions H in order to minimize the expected loss:

假設我們觀察一個示例資料集D = {(xk, yk)}k=1..n,其中xk = (x 1 k,…, xm k)是m個特征的随機向量,yk∈R是目标,可以是二進制響應,也可以是數值響應。例(xk, yk)是獨立的,根據某個未知的分布P(,·)恒等分布。學習任務的目标是訓練函數F: R m→R,使期望損失L(F):= EL(y, F(x))最小。這裡L(·,·)是平滑損失函數,(x, y)是獨立于訓練集D從P中采樣的測試示例。

梯度提升程式[12]疊代地建構一系列近似值F t:R m→R,t = 0,1,……以貪婪的方式 即,以加法方式從先前的近似值F t-1獲得F t:F t = F t-1 +αht,其中α是步長,并且函數ht:選擇R m→R(基本預測變量) 為了減少預期損失而從函數族H中提取:

最小化問題通常通過牛頓法來解決,即在F t-1處使用L(F t-1 + h t)的二階逼近或采用(負)梯度步驟。 兩種方法都是功能梯度下降法[10,24]。 特别地,以使得ht(x)近似于-gt(x,y)的方式選擇梯度步長ht,其中gt(x,y):=∂L(y,s)∂ss= F t-1 (X)  。 通常,使用最小二乘近似:

CatBoost is an implementation of gradient boosting, which uses binary decision trees as base  predictors. A decision tree [4, 10, 27] is a model built by a recursive partition of the feature space  R  m into several disjoint regions (tree nodes) according to the values of some splitting attributes a.  Attributes are usually binary variables that identify that some feature x  k  exceeds some threshold t,  that is, a = 1{xk>t}, where x  k  is either numerical or binary feature, in the latter case t = 0.5.  2 Each  final region (leaf of the tree) is assigned to a value, which is an estimate of the response y in the  region for the regression task or the predicted class label in the case of classification problem.3  In  this way, a decision tree h can be written as  h(x) = X  J  j=1  bj1{x∈Rj }, (3)  

where Rj are the disjoint regions corresponding to the leaves of the tree.

CatBoost是梯度提升的一種實作,它使用二進制決策樹作為基礎預測變量。 決策樹[4、10、27]是通過将特征空間R m的遞歸劃分為根據一些分裂屬性a的值的幾個不相交的區域(樹節點)而建立的模型。 屬性通常是二進制變量,用于辨別某個特征x k超過某個門檻值t,即a = 1 {xk> t},其中x k是數字特征或二進制特征,在後一種情況下t = 0.5。 2每個最終區域(樹的葉子)被配置設定一個值,該值是在分類問題的情況下對回歸任務或預測類别标簽的區域中響應y的估計。這樣,就可以做出決策 樹h可以寫成a

其中Rj是與樹的葉子相對應的不相交區域。

3 Categorical features類别特征

3.1 Related work on categorical feature與類别特征相關的工作

A categorical feature is one with a discrete set of values called categories that are not comparable to  each other. One popular technique for dealing with categorical features in boosted trees is one-hot  encoding [7, 25], i.e., for each category, adding a new binary feature indicating it. However, in the  case of high cardinality features (like, e.g., “user ID” feature), such technique leads to infeasibly  large number of new features. To address this issue, one can group categories into a limited number  of clusters and then apply one-hot encoding. A popular method is to group categories by target  statistics (TS) that estimate expected target value in each category. Micci-Barreca [25] proposed  to consider TS as a new numerical feature instead. Importantly, among all possible partitions of categories into two sets, an optimal split on the training data in terms of logloss, Gini index, MSE  can be found among thresholds for the numerical TS feature [4, Section 4.2.2] [11, Section 9.2.4].  In LightGBM [20], categorical features are converted to gradient statistics at each step of gradient  boosting. Though providing important information for building a tree, this approach can dramatically  increase (i) computation time, since it calculates statistics for each categorical value at each step, and  (ii) memory consumption to store which category belongs to which node for each split based on a  categorical feature. To overcome this issue, LightGBM groups tail categories into one cluster [21] and  thus looses part of information. Besides, the authors claim that it is still better to convert categorical  features with high cardinality to numerical features [19]. Note that TS features require calculating  and storing only one number per one category.

類别特征是具有一組離散的值的值,這些值稱為類别,彼此之間不具有可比性。一種用于處理增強樹中分類特征的流行技術是單熱編碼[7,25],即針對每個類别,添加一個訓示它的新二進制特征。但是,在高基數特征(例如“使用者ID”特征)的情況下,這種技術導緻不可行的大量新特征。為了解決此問題,可以将類别分組為有限數量的群集,然後應用一鍵編碼。一種流行的方法是按目标統計量(TS)對類别進行分組,這些目标統計量估計每個類别中的預期目标值。 Micci-Barreca [25]建議将TS視為一個新的數值特征。重要的是,在所有可能的類别劃分為兩組的過程中,可以在數值TS特征的門檻值中找到訓練資料的對數損失,基尼系數,MSE的最佳分割[4,4.2.2節] [11,11節]。 9.2.4]。在LightGBM [20]中,在梯度增強的每個步驟中将分類特征轉換為梯度統計量。盡管此方法提供了用于建構樹的重要資訊,但可以顯着增加(i)計算時間,因為它在每個步驟都為每個分類值計算統計資訊,并且(ii)基于每個拆分存儲哪個類别屬于哪個節點的記憶體消耗在分類特征上。為了克服這個問題,LightGBM将尾部類别分組為一個叢集[21],是以會丢失部分資訊。此外,作者認為将具有高基數的分類特征轉換為數字特征仍然更好[19]。請注意,TS功能隻需要計算和存儲每個類别的一個數字。。

Thus, using TS as new numerical features seems to be the most efficient method of handling  categorical features with minimum information loss. TS are widely-used, e.g., in the click prediction  task (click-through rates) [1, 15, 18, 22], where such categorical features as user, region, ad, publisher  play a crucial role. We further focus on ways to calculate TS and leave one-hot encoding and gradient  statistics out of the scope of the current paper. At the same time, we believe that the ordering principle  proposed in this paper is also effective for gradient statistics.

是以,使用TS作為新的數字特征似乎是處理分類特征以最小的資訊損失的,最有效的方法。TS被廣泛使用,例如在點選預測任務(點選率)[1,15,18,22]中,使用者、區域、廣告、釋出者等分類特征起着至關重要的作用。我們進一步關注計算TS的方法,而将one-hot編碼和梯度統計置于本文的讨論範圍之外。同時,我們認為本文提出的排序原則對梯度統計也是有效的。

2Alternatively, non-binary splits can be used, e.g., a region can be split according to all values of a categorical  feature. However, such splits, compared to binary ones, would lead to either shallow trees (unable to capture  complex dependencies) or to very complex trees with exponential number of terminal nodes (having weaker  target statistics in each of them). According to [4], the tree complexity has a crucial effect on the accuracy of the  model and less complex trees are less prone to overfitting.  

3  In a regression task, splitting attributes and leaf values are usually chosen by the least–squares criterion.  Note that, in gradient boosting, a tree is constructed to approximate the negative gradient (see Equation (2)), so  it solves a regression problem.

另外,也可以使用非二進制分割,例如,一個區域可以根據一個分類特征的所有值進行分割。然而,與二進制拆分相比,這樣的拆分要麼導緻較淺的樹(無法捕獲複雜的依賴項),要麼導緻具有指數級終端節點數量的非常複雜的樹(每個樹中都有較弱的目标統計資訊)。根據[4],樹的複雜性對模型的準确性有至關重要的影響,較不複雜的樹不容易發生過拟合。

在一個回歸任務中,分裂屬性和葉子值通常是由最小二乘準則選擇的。注意,在梯度提升中,構造一棵樹來近似負梯度(見式(2)),是以它解決了一個回歸問題。

3.2 Target statistics目标統計

As discussed in Section 3.1, an effective and efficient way to deal with a categorical feature i is  to substitute the category x  i  k  of k-th training example with one numeric feature equal to some  target statistic (TS) xˆ  i  k  . Commonly, it estimates the expected target y conditioned by the category:  xˆ  i  k ≈ E(y | x  i = x  i  k  ).  

如第3.1節所述,處理分類特征i的有效方法是用一個等于某個目标統計量(TS)x)i k的數值特征替換第k個訓練示例的類别x i k。 通常,它以以下類别為條件來估計預期目标y:xˆ i k≈E(y | x i = x i k)。

Greedy TS 貪婪的TS

A straightforward approach is to estimate E(y | x  i = x  i  k  ) as the average value of y  over the training examples with the same category x  i  k  [25]. This estimate is noisy for low-frequency  categories, and one usually smoothes it by some prior p:

where a > 0 is a parameter. A common setting for p is the average target value in the dataset [25].  The problem of such greedy approach is target leakage: feature xˆ  i  k  is computed using yk, the target of  xk. This leads to a conditional shift [30]: the distribution of xˆ  i  |y differs for training and test examples.  The following extreme example illustrates how dramatically this may affect the generalization error  of the learned model. Assume i-th feature is categorical, all its values are unique, and for each  category A, we have P(y = 1 | x  i = A) = 0.5 for a classification task. Then, in the training dataset,  xˆ  i  k =  yk+ap  1+a  , so it is sufficient to make only one split with threshold t =  0.5+ap  1+a  to perfectly classify  all training examples. However, for all test examples, the value of the greedy TS is p, and the obtained  model predicts 0 for all of them if p < t and predicts 1 otherwise, thus having accuracy 0.5 in both  cases. To this end, we formulate the following desired property for TS:

一個簡單的方法是估計E(y | x i = x i k)作為y除以具有相同類别x i k[25]的訓練執行個體的平均值。對于低頻類别,這個估計是有噪聲的,人們通常通過一些先驗p來平滑它:

其中a> 0是參數。 p的常見設定是資料集中的平均目标值[25]。這種貪婪方法的問題是目标洩漏:特征xˆ i k是使用xk的目标yk計算的。這導緻條件轉移[30]:對于訓練和測試示例,xˆ i | y的分布不同。以下極端示例說明了這可能嚴重影響學習模型的泛化誤差。假設第i個特征是分類的,其所有值都是唯一的,并且對于每個類别A,對于分類任務,我們的P(y = 1 | x i = A)= 0.5。然後,在訓練資料集中,xˆ k = yk + ap 1 + a,是以僅對門檻值t = 0.5 + ap 1 + a進行一次分割就足以對所有訓練示例進行完美分類。但是,對于所有測試示例,貪婪TS的值為p,并且如果p <t,則所獲得的模型對所有示例的預測均為0,否則為1,是以在兩種情況下的準确度均為0.5。為此,我們為TS制定了以下所需屬性:

P1 E(ˆx  i  | y = v) = E(ˆx  i  k  | yk = v), where (xk, yk) is the k-th training example.  In our example above, E(ˆx  i  k  | yk) = yk+ap  1+a  and E(ˆx  i  | y) = p are different.  There are several ways to avoid this conditional shift. Their general idea is to compute the TS for xk  on a subset of examples Dk ⊂ D \ {xk} excluding xk:

P1 E(ˆx i | y = v)= E(ˆi k | yk = v),其中(xk,yk)是第k個訓練示例。 在上面的示例中,E(ˆx i k | yk)= yk + ap 1 + a和E(ˆx i | y)= p是不同的。 有幾種方法可以避免這種條件轉移。 他們的一般想法是在示例xk⊂D \ {xk}的子集中計算xk的TS,不包括xk:

Holdout TS留下的TS

One way is to partition the training dataset into two parts D = Dˆ  0 t Dˆ  1 and use  Dk = Dˆ  0 for calculating the TS according to (5) and Dˆ  1 for training (e.g., applied in [8] for Criteo  dataset). Though such holdout TS satisfies P1, this approach significantly reduces the amount of data  used both for training the model and calculating the TS. So, it violates the following desired property:  

P2 Effective usage of all training data for calculating TS features and for learning a model.

一種方法是訓練資料集分割成兩部分D = Dˆ0 t Dˆ1和使用Dk = Dˆ0計算TS根據(5)和Dˆ1訓練(例如,應用于Criteo資料集[8])。盡管這種Holdout TS滿足P1,但這種方法顯著減少了用于訓練模型和計算TS的資料量。是以,它違反了以下期望的性質:

P2有效使用所有訓練資料來計算TS特征和學習模型。

Leave-one-out TS留一的TS

At first glance, a leave-one-out technique might work well: take Dk = D \ xk  for training examples xk and Dk = D for test ones [31]. Surprisingly, it does not prevent target  leakage. Indeed, consider a constant categorical feature: x  i  k = A for all examples. Let n  + be the  number of examples with y = 1, then xˆ  i  k =  n  +−yk+a p  n−1+a  and one can perfectly classify the training  dataset by making a split with threshold t =  n  +−0.5+a p  n−1+a  .

乍一看,留一法技術可能會很好用:對于訓練示例xk,使Dk = D \ xk;對于測試樣本,使Dk = D [31]。 令人驚訝的是,它不能防止目标洩漏。 實際上,考慮一個恒定的分類特征:對于所有示例,x i k =A。 設n +為y = 1的示例數,則xˆ ik = n + -yk + apn-1 + a,通過對門檻值t = n + -0.5 + apn-進行分割,可以完美地對訓練資料集進行分類。 1 + a。

Ordered TS帶順序的TS

CatBoost uses a more effective strategy. It relies on the ordering principle, the  central idea of the paper, and is inspired by online learning algorithms which get training examples  sequentially in time [1, 15, 18, 22]). Clearly, the values of TS for each example rely only on the  observed history. To adapt this idea to standard offline setting, we introduce an artificial “time”, i.e.,  a random permutation σ of the training examples. Then, for each example, we use all the available  “history” to compute its TS, i.e., take Dk = {xj : σ(j) < σ(k)} in Equation (5) for a training  example and Dk = D for a test one. The obtained ordered TS satisfies the requirement P1 and allows  to use all training data for learning the model (P2). Note that, if we use only one random permutation,  then preceding examples have TS with much higher variance than subsequent ones. To this end,  CatBoost uses different permutations for different steps of gradient boosting, see details in Section 5.

CatBoost使用了更有效的政策。它依賴于本文的中心思想——排序原則,并受到線上學習算法的啟發,線上學習算法在時間上連續地擷取訓練示例[1,15,18,22])。顯然,每個示例的TS值隻依賴于觀察到的曆史。為了使這一思想适應于标準的離線設定,我們引入了一個人工的“時間”,即訓練執行個體的随機排列σ。然後,對于每個例子,我們使用所有可用的“history”來計算它的TS,即取式(5)中的Dk = {xj:σ(j) <σ(k)}作為訓練例,取式(5)中的Dk = D作為測試例。得到的有序TS滿足要求P1,允許使用所有訓練資料學習模型(P2)。請注意,如果我們隻使用一個随機排列,那麼前面的例子的TS的方差要比後面的例子大得多。為此,CatBoost對梯度增強的不同步驟使用不同的排列,詳見第5節。

4 Prediction shift and ordered boosting預測偏移和有序提升

4.1 Prediction shift預測偏移

In this section, we reveal the problem of prediction shift in gradient boosting, which was neither  recognized nor previously addressed. Like in case of TS, prediction shift is caused by a special kind  of target leakage. Our solution is called ordered boosting and resembles the ordered TS method.  Let us go back to the gradient boosting procedure described in Section 2. In practice, the expectation  in (2) is unknown and is usually approximated using the same dataset D:

在本節中,我們揭示了梯度增強中預測偏移的問題,這個問題既沒有被認識到,也沒有被解決。與TS情況一樣,預測偏移是由一種特殊的目标洩漏引起的。我們的解決方案稱為有序助推,類似于有序TS方法。讓我們回到第2節中描述的梯度增強過程。在實踐中,(2)中的期望是未知的,通常使用相同的資料集D來近似:

Now we describe and analyze the following chain of shifts:  

1. the conditional distribution of the gradient g  t  (xk, yk) | xk (accounting for randomness of  D \ {xk}) is shifted from that distribution on a test example g  t  (x, y) | x;  

2. in turn, base predictor h  t defined by Equation (6) is biased from the solution of Equation (2);  

3. this, finally, affects the generalization ability of the trained model F  t  .  

As in the case of TS, these problems are caused by the target leakage. Indeed, gradients used at each  step are estimated using the target values of the same data points the current model F  t−1 was built on.  However, the conditional distribution F  t−1  (xk) | xk for a training example xk is shifted, in general,  from the distribution F  t−1  (x) | x for a test example x. We call this a prediction shift

現在我們來描述和分析以下的轉變鍊:

1. 将梯度g t (xk, yk) | xk(考慮D \ {xk}的随機性)的條件分布從該分布移到一個檢驗例g t (x, y) | x上;

2. 反之,由式(6)定義的基預測量h t與式(2)的解有偏倚;

3.這最終影響了訓練後的模型F t的泛化能力。

與TS的情況一樣,這些問題是由目标洩漏引起的。實際上,每一步使用的梯度都是使用目前模型F t−1所建立的相同資料點的目标值來估計的。然而,訓練例xk的條件分布F t−1 (xk) | xk通常從測試例x的分布F t−1 (x) | x移位。我們稱之為預測移位