XGBoost參數詳解

本文參考自 Complete Guide to Parameter Tuning in XGBoost (with codes in Python)，在其翻譯基礎上個别地方加上了自己的補充。

XGBoost的優點

正則
- 标準的GBM實作是沒有正則的。
- XGBoost也以regularized boosting技術聞名。
并行處理
- XGBoost實作了并行化的處理。
- XGBoost基于boosting方法，原則上不可以并行。但是XGBoost做的是特征粒度上的并行，而不是樹粒度上的并行。
- XGBoost支援hadoop上實作。
高度的靈活性
- XGBoost允許使用者自定義優化目标和評估準則。
處理缺失值方面
- XGBoost自有一套處理缺失值的方法。
樹剪枝方面
- GBM是在節點遇到負損失的時候停止分裂，貪心政策。（預剪枝）
- XGBoost是在分裂抵達max_depth的時候才開始剪枝移除那些沒有正收益的分裂。（後剪枝）
内置的交叉驗證
- XGBoost在boosting過程中的每一次疊代都運作CV，使得在一次運作中就可以确定最優的boosting疊代次數。
- GBM需要我們使用網格搜尋來找最優次數。
可以在存在的模型上繼續訓練
- 在某些特定的場景下非常有用。
- GBM的sklearn實作也有這個特點。

XGBoost的參數

General Parameters
1. booster [default=gbtree]
  - gbtree：基于樹的模型
  - gblinear：線型模型
2. silent [default=0]
  - 0會輸出運作資訊，1不會輸出運作資訊，建議保持0有助于了解模型
3. nthread [如果不設定預設與最大的可以線程數相同]
  - 用來控制并行過程，如果想要在所有的核上都運作就讓系統自己設定就好
4. num_pbuffer：prediction buffe，系統自動設定
5. num_feature：boosting過程中的特征維數，系統自動設定
Booster Paramters 包含兩類Booster
1. eta [default = 0.3]
  - 類似于GBM中的學習率
  - 取值範圍：[0,1]，通常設定為0.01~0.2
2. gamma [default=0 别名：min_split_loss]
  - 一個葉子節點繼續分裂所需要的最小下降損失。值越大，模型越保守/越不容易過拟合。
  - 取值範圍：[0, ∞]
3. max_depth [default = 6]
  - 樹的最大深度。值越大模型越複雜/越可能過拟合。設為0表示不限制。
  - 取值範圍：[0, ∞]
4. min_child_weight [default=1]
  - 孩子節點需要的最小樣本權重和。如果分裂導緻一個葉子節點的樣本權重和小于預設值，就不會繼續分裂了。
  - 線上型模型中，簡化為每個節點所需要的最小樣本數量（?）。
  - 值越大，模型越保守。
5. min_delta_step [default=0]
  - Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint.
  - 用的比較少，但是在邏輯回歸中，如果類别極度不平衡，調整這個值會有幫助。
6. subsample [default = 1]
  - 行采樣，不用多說。取值範圍：(0, 1]
7. colsample_bytree [default = 1]
  - 列采樣，在建立每一棵樹的時候對特征的采樣比例。取值範圍：(0, 1]
8. colsample_bylevel [default = 1]
  - 在每一次分裂時候列采樣的比例（?），用的很少。取值範圍：(0, 1]
9. alpha [default = 0]
  - 權重上的L1正則
10. lambda [default = 1]
  - 權重上的L2正則
11. tree method [default = ‘auto’] 詳見XGBoost論文
  - 這個不是樹建構的方法，而是節點分裂的方法，其中
  - ‘auto’: Use heuristic to choose faster one.
    - For small to medium dataset, exact greedy will be used.
    - For very large-dataset, approximate algorithm will be chosen.
    - Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
  - ‘exact’: Exact greedy algorithm.
  - ‘approx’: Approximate greedy algorithm using sketching and histogram.
  - ‘hist’: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bins caching.
  - ‘gpu_exact’: GPU implementation of exact algorithm.
  - ‘gpu_hist’: GPU implementation of hist algorithm.
12. scale_pos_weight [defualt = 1]
  - 正負樣本比例。用來控制正負樣本的權重，在類别不平衡的時候用處很大。
  - 常用的計算方法：sum(negative cases) / sum(positive cases)
13. 【Linear Booster】中有lambda，alpha，lambda_bias（在偏置上的L2正則，為什麼偏置上沒有L1正則，因為不重要）。
Learning Task Parameters
1. objective [default=reg:linear] 定義學習目标函數。
  - 常用的：”reg:linear”，”reg:logistic”，”binary:logistic”
  - 可以自定義目标函數，需要傳入一階，二階導數
2. base_score 幾乎不用
3. eval_metric [預設值根據objective]
  - 可以傳多個評估名額。python特别注意要傳list而不是map。

參考：

1.Complete Guide to Parameter Tuning in XGBoost (with codes in Python

2.XGBoost Parameters

XGBoost參數詳解

XGBoost的優點

XGBoost的參數

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入