天天看點

xgboost的參數詳細說明

基本用法

先列出Xgboost中可指定的參數,參數的詳細說明如下

總共有3類參數:通用參數/general parameters, 內建(增強)參數/booster parameters 和 任務參數/task

parameters

通用參數/General Parameters

  • booster [default=gbtree]
    • gbtree 和 gblinear
  • silent [default=0]
    • 0表示輸出資訊, 1表示安靜模式
  • nthread
    • 跑xgboost的線程數,預設最大線程數
  • num_pbuffer [無需使用者手動設定]
    • size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
  • num_feature [無需使用者手動設定]
    • feature dimension used in boosting, set to maximum dimension of the feature

內建(增強)參數/booster parameters

  • eta [default=0.3, 可以視作學習率]
    • 為了防止過拟合,更新過程中用到的收縮步長。在每次提升計算之後,算法會直接獲得新特征的權重。 eta通過縮減特征的權重使提升計算過程更加保守。預設值為0.3
    • 取值範圍為:[0,1]
  • gamma [default=0, alias: min_split_loss]
    • 為了對樹的葉子節點做進一步的分割而必須設定的損失減少的最小值,該值越大,算法越保守
    • range: [0,∞]
  • max_depth [default=6]
    • 用于設定樹的最大深度
    • range: [1,∞]
  • min_child_weight [default=1]
    • 表示子樹觀測權重之和的最小值,如果樹的生長時的某一步所生成的葉子結點,其觀測權重之和小于min_child_weight,那麼可以放棄該步生長,線上性回歸模式中,這僅僅與每個結點所需的最小觀測數相對應。該值越大,算法越保守
    • range: [0,∞]
  • max_delta_step [default=0]
    • 如果該值為0,就是沒有限制;如果設為一個正數,可以使每一步更新更加保守通常情況下這一參數是不需要設定的,但是在logistic回歸的訓練集中類極端不平衡的情況下,将這一參數的設定很有用,将該參數設為1-10可以控制每一步更新
    • range: [0,∞]
  • subsample [default=1]
    • 表示觀測的子樣本的比率,将其設定為0.5意味着xgboost将随機抽取一半觀測用于數的生長,這将有助于防止過拟合現象
    • range: (0,1]
  • colsample_bytree [default=1]
    • 表示用于構造每棵樹時變量的子樣本比率
    • range: (0,1]
  • colsample_bylevel [default=1]
    • 用來控制樹的每一級的每一次分裂,對列數的采樣的占比。一般不太用這個參數,因為subsample參數和colsample_bytree參數可以起到相同的作用。
    • range: (0,1]
  • lambda [default=1, alias: reg_lambda]
    • L2 權重的L2正則化項
  • alpha [default=0, alias: reg_alpha]
    • L1 權重的L1正則化項
  • tree_method, string [default=‘auto’]
    • The tree construction algorithm used in XGBoost(see description in the reference paper)
    • Distributed and external memory version only support approximate algorithm.
    • Choices: {‘auto’, ‘exact’, ‘approx’}
      • ‘auto’: Use heuristic to choose faster one.
        • For small to medium dataset, exact greedy will be used.
        • For very large-dataset, approximate algorithm will be chosen.
        • Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
      • ‘exact’: Exact greedy algorithm.
      • ‘approx’: Approximate greedy algorithm using sketching and histogram.
  • sketch_eps, [default=0.03]
    • This is only used for approximate greedy algorithm.
    • This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
    • Usually user does not have to tune this. but consider setting to a lower number for more accurate enumeration.
    • range: (0, 1)
  • scale_pos_weight, [default=1]
    • 在各類别樣本十分不平衡時,把這個參數設定為一個正值,可以使算法更快收斂
    • 一個可以考慮的值: sum(negative cases) / sum(positive cases) see Higgs Kaggle competition demo for examples: R, py1, py2, py3
  • updater, [default=‘grow_colmaker,prune’]
    • A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitely by a user. The following updater plugins exist:
      • ‘grow_colmaker’: non-distributed column-based construction of trees.
      • ‘distcol’: distributed tree construction with column-based data splitting mode.
      • ‘grow_histmaker’: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
      • ‘grow_local_histmaker’: based on local histogram counting.
      • ‘grow_skmaker’: uses the approximate sketching algorithm.
      • ‘sync’: synchronizes trees in all distributed nodes.
      • ‘refresh’: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
      • ‘prune’: prunes the splits where loss < min_split_loss (or gamma).
    • In a distributed setting, the implicit updater sequence value would be adjusted as follows:
      • ‘grow_histmaker,prune’ when dsplit=‘row’ (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
      • ‘grow_histmaker,refresh,prune’ when dsplit=‘row’ and prob_buffer_row < 1
      • ‘distcol’ when dsplit=‘col’
  • refresh_leaf, [default=1]
    • This is a parameter of the ‘refresh’ updater plugin. When this flag is true, tree leafs as well as tree nodes’ stats are updated. When it is false, only node stats are updated.
  • process_type, [default=‘default’]
    • A type of boosting process to run.
    • Choices: {‘default’, ‘update’}
      • ‘default’: the normal boosting process which creates new trees.
      • ‘update’: starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: ‘refresh’, ‘prune’. With ‘update’, one cannot use updater plugins that create new nrees.

任務參數/task parameters

  • objective [ default=reg:linear ] 這個參數定義需要被最小化的損失函數。最常用的值有
    • “reg:linear” --線性回歸
    • “reg:logistic” --邏輯回歸
    • “binary:logistic” --二分類的邏輯回歸,傳回預測的機率(不是類别)
    • “binary:logitraw” --輸出歸一化前的得分
    • “count:poisson” --poisson regression for count data, output mean of poisson distribution
      • max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
    • “multi:softmax” --設定XGBoost做多分類,你需要同時設定num_class(類别數)的值
    • “multi:softprob” --輸出次元為ndata * nclass的機率矩陣
    • “rank:pairwise” --設定XGBoost去完成排序問題(最小化pairwise loss)
    • “reg:gamma” --gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed
    • “reg:tweedie” --Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
  • base_score [ default=0.5 ]
    • the initial prediction score of all instances, global bias
    • for sufficient number of iterations, changing this value will not have too much effect.
  • eval_metric [ 預設是根據 損失函數/目标函數 自動標明的 ]
    • 有如下的選擇:
      • “rmse”: 均方誤差
      • “mae”: 絕對平均誤差
      • “logloss”: negative log損失
      • “error”: 二分類的錯誤率
      • “[email protected]”: 通過提供t為門檻值(而不是0.5),計算錯誤率
      • “merror”: 多分類的錯誤類,計算公式為#(wrong cases)/#(all cases).
      • “mlogloss”: 多類log損失
      • “auc”: ROC曲線下方的面積 for ranking evaluation.
      • “ndcg”:Normalized Discounted Cumulative Gain
      • “map”:平均準确率
      • “[email protected]”,“[email protected]”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
      • “ndcg-”,“map-”,“[email protected]”,“[email protected]”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatedly
    • “poisson-nloglik”: negative log-likelihood for Poisson regression
    • “gamma-nloglik”: negative log-likelihood for gamma regression
    • “gamma-deviance”: residual deviance for gamma regression
    • “tweedie-nloglik”: negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)
  • seed [ default=0 ]
    • random number seed.

繼續閱讀