什么是交叉验证？

1 什么是交叉验证？
- 1.1 思想
- 1.2 疑问
2 为什么要交叉验证？
3 Python实现交叉验证
- 3.1 简单交叉验证
- 3.2 S折交叉验证
- 3.3 留一交叉验证
4 参考

1 什么是交叉验证？

1.1 思想

基本思想：

将原始数据（dataset）进行分组，一部分做为训练集来训练模型，另一部分做为测试集来评价模型。

交叉验证是一种模型选择的方法！（引自李航统计学习方法）可以分为以下三种：

简单交叉验证。即将数据按照一定比例比如73开，分为训练集和测试集。
S折交叉验证。将已给数据切分为S个互不相交、大小相同的子集，将S-1个子集的数据作为训练集来训练模型，剩余的一个测试模型，重复S次，选择S次中平均测试误差最小的模型。
留一交叉验证。即S=n。往往在数据缺乏的时候使用。因为数据很少没法再分了！

注1：由于交叉验证是用来模型选择的，所以是将不同的模型，比如SVM，LR，GBDT等运用上述方法，然后比较误差大小，选择误差最小的模型！

注2：上述三种方法是针对数据量不充足的时候采用的交叉验证方法，如果数据量充足，一种简单的方法就是将数据分为3部分：

训练集。用来训练模型
验证集。用于模型选择
测试集。用于最终对学习方法的评估

选择验证集上有最小预测误差的模型。

注3：如果数据量为bigdata，这时候可以不用7/3开了，照样分成训练集，验证集，测试集三份。比如100万的数据量，完全可以将数据分成98:1:1，即验证集只要1万即可，测试集也只要1万即可，更多的数据用在训练集来训练模型！

1.2 疑问

一旦使用交叉验证，模型还需要划分训练集和测试集吗？

回答：需要！保持测试集的纯净！对训练集进行简单或者K折交叉验证都ok！目的是评估模型在训练集上的效果，如果在训练集上都表现不好，那么这个问题就是一个不容易学习的问题，进而测试集上大概率也会表现得不好。但核心还是看模型在测试集上的表现效果，并根据测试集结果来评估模型的优劣！所以用简单或者K折交叉验证都ok，甚至不做也行，主要看测试集！

回答2：如果是kaggle等比赛，由于测试集没有标签，只提供了训练集，这时候当然也有两种方式：即普通的划分以及K折交叉验证。但建议K折交叉验证，然后取每一折上测试集评估效果的均值来衡量模型的预测效果！因为这时候我们仅仅只有这个数据，所以希望尽可能多的利用数据的信息！

2 为什么要交叉验证？

交叉验证用于评估模型的预测性能，尤其是训练好的模型在新数据上的表现，可以在一定程度上减小过拟合。
完整的利用数据信息！而不是简单的分成7 3 开！

3 Python实现交叉验证

3.1 简单交叉验证

from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv('data/telecom_churn.csv')
print(df.shape)
df.head()

(3463, 20)

subscriberID	churn	gender	AGE	edu_class	incomeCode	duration	feton	peakMinAv	peakMinDiff	posTrend	negTrend	nrProm	prom	curPlan	avgplan	planChange
19164958	1	20	2	12	16	113.666667	-8.0	1	1	1
1	39244924	1	1	20	21	5	274.000000	-371.0	1	2	1	3	2	2	1	1
2	39578413	1	11	1	47	3	392.000000	-784.0	1	3	3	1
3	40992265	1	43	4	12	31.000000	-76.0	1	2	1	3	3	1
4	43061957	1	1	60	9	14	129.333333	-334.0	1	3	3

X = df.drop(['subscriberID', 'churn'], axis=1)
y = df['churn'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 23)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2424, 18) (1039, 18) (2424,) (1039,)

通过训练集来训练模型，然后看测试集表现来评估该模型！如何评估呢？看之前的博客：机器学习 | 评价指标

3.2 S折交叉验证

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
clf = LogisticRegression()
gbdt = GradientBoostingClassifier()
scores1 = cross_val_score(clf, X, y, cv=10, scoring='roc_auc')
scores2 = cross_val_score(gbdt, X, y, cv=10, scoring='roc_auc')
print('逻辑回归 10折交叉验证 平均AUC结果为: %.2f ' % scores1.mean())
print('GBDT 10折交叉验证 平均AUC结果为: %.2f ' % scores2.mean())

逻辑回归 10折交叉验证 平均AUC结果为: 0.92 
GBDT 10折交叉验证 平均AUC结果为: 0.95

所以初步来看，在这个数据集上，通过10折交叉验证来看，GBDT的预测效果比LR要好！

3.3 留一交叉验证

(3463, 20)

from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
for train, test in loo.split(X, y):
    print('%s - %s' % (train.shape, test.shape))

(3462,) - (1,)
(3462,) - (1,)
(3462,) - (1,)
(3462,) - (1,)
(3462,) - (1,)
......
(3462,) - (1,)
(3462,) - (1,)
(3462,) - (1,)
(3462,) - (1,)

可以看到留一交叉验证，每一次把n-1的作为训练集，1作为验证集！即训练集样本量为n-1，测试集为1

# 取训练集
df.iloc[train, :].head()

subscriberID	churn	gender	AGE	edu_class	incomeCode	duration	feton	peakMinAv	peakMinDiff	posTrend	negTrend	nrProm	prom	curPlan	avgplan	planChange
19164958	1	20	2	12	16	113.666667	-8.0	1	1	1
1	39244924	1	1	20	21	5	274.000000	-371.0	1	2	1	3	2	2	1	1
2	39578413	1	11	1	47	3	392.000000	-784.0	1	3	3	1
3	40992265	1	43	4	12	31.000000	-76.0	1	2	1	3	3	1
4	43061957	1	1	60	9	14	129.333333	-334.0	1	3	3

# 取测试集
df.iloc[test, :].head()

subscriberID	churn	gender	AGE	edu_class	incomeCode	duration	feton	peakMinAv	peakMinDiff	posTrend	negTrend	nrProm	prom	curPlan	avgplan	planChange	posPlanChange	negPlanChange	call_10000
3462	77861800	24	46	7	1	321.333333	-4.0	1	2	2

4 参考

https://blog.csdn.net/aliceyangxi1987/article/details/73532651
scoring的参数见 https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
cross_val_score参数见：https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

机器学习 | 交叉验证1 什么是交叉验证？2 为什么要交叉验证？3 Python实现交叉验证4 参考

什么是交叉验证？

1 什么是交叉验证？

1.1 思想

1.2 疑问

2 为什么要交叉验证？

3 Python实现交叉验证

3.1 简单交叉验证

3.2 S折交叉验证

3.3 留一交叉验证

4 参考

继续阅读

XGBoost Plotting API以及GBDT组合特征实践 XGBoost Plotting API以及GBDT组合特征实践

解码器用于语义分割：数据依赖的解码可以实现灵活的特征聚合

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

2021-2025年中国运动疗法（KT）带行业市场供需与战略研究报告

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入