如何评估机器学习模型（How to Evaluate Machine Learning Models）

本博文是对How to Evaluate Machine Learning Models这一博文的一个简单翻译和总结，文章主要从Evaluation Metrics ，Testing Mechanisms，Hyperparameter Tuning和A/B testing四个角度对机器学习模型的评价做了一一分析和讨论，建议有能力的人直接看原PO文。

1.评价指标(Evaluation Metrics )

1.1 Classification metrics

假设有100个正样本，200个负样本，其中正样本被误分类为负样本的个数为20，负样本被误分类为正样本的个数为5。

A. 精度（Accuracy）：

如何评估机器学习模型（How to Evaluate Machine Learning Models）

精度：(80 + 195) /(100 + 200) = 91 . 7%

B. 混淆矩阵（Confusion matrix）：

当不同类别数据的样本不平衡或者不同类别的误分类代价不同时，考虑使用混淆矩阵。

正样本误分率：(20/(20+80)=20%)

负样本误分率：(5/(5+195)=2.5%).

C. 平均每类精度（Per-class accuracy）：

数据不平衡时考虑：

平均每类精度：(80%+97.5%)/2=88.75%

如果某类样本非常少时，结果会很不可靠，因为误差太大。

D. 对数代价（Log-loss）：

分类器的输出为数值概率，应该考虑对数代价。对数代价是精度的一个软度量，

对数代价是真实标签和预测标签分布间的交叉熵。类似于信息理论中的相对熵。

最小化交叉熵可以最大化分类精度。

E. 曲线下面积（AUC，Area Under the Curve）：

这里的曲线是ROC曲线，如下图：

ROC曲线通过画出真阳性率和假阳性率来展示分类器的敏感程度。

（原文：The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives. In other words, it shows you how many correct positive classifications can be gained as you allow for more and more false positives.）

AUC是对ROC曲线的一个数值量化，使得其便于比较。

1.2 Ranking and Regression Metrics

A. 精度和召回率（Precision-Recall）

B. Precision-Recall curve and F1 score

通过画出precision vs. recall关于K （排序的顶级K个最优）值变化的曲线就是precision-recall曲线。

precision-recall曲线数值化:

Harmonic means：

C. NDCG

Cumulative Gain (CG)：累计顶级K项的相关度。sums up the relevance of the top k items

Discounted Cumulative Gain(DCG)

Discounted的factor为：

Normalized Discounted Cumulative Gain (NDCG)：

NDCG如何理解？对于搜索引擎，本质是用户搜一个query，引擎返回一个结果列表，那么如何衡量这个结果列表的好坏？我能够想到的是：

1. 我们希望把最相关的结果放到排名最靠前的位置，因为大部分用户都是从上往下阅读的，那么最相关的在前面可以最大程度减少用户的阅读时间；

2. 我们希望整个列表的结果尽可能的和query相关；

1.3 Regression metrics

A. RMSE (Root Mean Square Error)

B. Quantiles of errors

RMSE比较难解释，因此使用绝对误差百分比的分布来量化：

C.“Almost correct” predictions

百分比估计：

例如小于10%，

2. 测试机制（Testing Mechanisms）

三种情况：

Validation 发生在原型期，测试不同超参数的质量。

Offline Testing 发生在部署前的相对静止期，测试数据来自过去收集好的有标签数据。

Online Testing 发生在产品运营阶段，涉及A/B测试。

2.1 Hold-out数据集（Hold-out datasets）

Hold-out测试或验证假设数据独立同分布，随机选出部分数据集作为测试集，剩下的作为训练集。

2.2 交叉验证（Cross validation）

最常见的是 k-重交叉验证，分成k类，其中每一个分别作为测试集，其他作为训练集，最后去平均结果。

数据较小时使用，数据较大时很少使用。

Hold-out校验比它更快更好。

2.3 Bootstrap and jackknife

如果要获得验证得分的方差，考虑使用交叉验证或Bootstrap。

Jackknife再采样基本上就是k-重交叉验证，其中k等于训练集中的数据点数量。

Bootstrap可重复采样数据。重复N次后，Bootstrap集中单个样例的期望率为1-1/e =63.2%。

2.4 小结

3. 超参调优（Hyperparameter Tuning）

超参调优（hyperparameter tuning）是一个优化过程，有点类似于模型训练（model training）。但是它们是不同的，因为训练模型时，模型参数的质量可以写成一个数学形式，而超参调优时，超参的质量要依赖于模型训练的输出，因此不能写成一个闭合的公式。

3.1 网格搜索（Grid search）

网格搜索很简单，也容易并行执行。
但代价比较昂贵，并行可以考虑。

3.2 Random search

[8] 只评价网格搜索中的一部分随机点。Bergstra and Bengio实验表明实验60个点就OK了。

The moral of the story is: if the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability

3.3 Smart hyperparameter tuning

选取更少的超参数来评价它们的质量，然后决定下一步的采样位置。并行度低。

常用的三种：

- derivative-free optimization 使用启发式方式确定下一次采样位置

- Bayesian optimization 使用另一个函数响应面模型（response surface ），即响应函数，来确定下一次采样位置

- random forest smart tuning 同上

[6] 使用高斯过程来建模响应函数来确定下一个proposals。

SMAC [7] 通过训练回归森林来近似响应面。新的采样点就是随机森林认为的最优区域。通常在分类hyperparameters时性能比高斯处理要好。

无偏导优化（Derivative-free optimization）应用于没有明显导数的环境。原理：试着用一堆随机点来近似梯度找到最可能的搜索方向。常见方法有：genetic algorithm 和 Nelder-Mead。

3.4 常用超参调优软件包

常用超参调优软件包：

- Grid search and random search: GraphLab Create, scikit-learn.

- Bayesian optimization using Gaussian processes: Spearmint (from Jasper et al.)

- Bayesian optimization using Tree-based Parzen Estimators: Hyperopt (from Bergstra et al.)

- Random forest tuning: SMAC (from Hutter et al.)

- Hyper gradient: hypergrad (from Maclaurin et al.)

4. A/B测试（A/B testing）

Briefly, A/B testing involves the following steps:

1 . Split into randomized control/experimentation groups.

2. Observe behavior of both groups on the proposed methods.

3. Compute test statistics.

4. Compute p-value.

5. Output decision.

用作者的一段话来结尾吧