探索五种机器学习模型最优参数（某金融数据集）

1.在网格搜索部分其实会过拟合，因为网格搜索优化参数的过程中已经看过了整个训练集的数据然后挑选出来最优参数，接着再用最优参数去拟合训练数据集（相当于建模之前已经偷看了）

2.可以尝试分成三个数据集，训练数据集，验证数据集，测试数据集，用最优参数模型去拟合验证数据集。

导入各种包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,roc_curve,auc
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier

导入数据

数据理解

#单独提取出y列标签，和其余的88列标记为x
y=data['status']
X=data.drop('status',axis=1)
#X值的行列数，以及y的分布类型
print('X.shape:',X.shape)
print('y的分布:',y.value_counts())

X.shape: (4754, 88)
y的分布: 0    3561
1    1193
Name: status, dtype: int64

数据准备

#首先剔除一些明显无用的特征，如id_name,custid,trade_no,bank_card_no
X.drop(['id_name','custid','trade_no','bank_card_no'],axis=1,inplace=True)
print(X.shape)
#选取数值型特征
X_num=X.select_dtypes('number').copy()
print(X_num.shape)
type(X_num.mean())
#使用均值填充缺失值
X_num.fillna(X_num.mean(),inplace=True)
#观察数值型以外的变量
X_str=X.select_dtypes(exclude='number').copy()
X_str.describe()
#把reg_preference用虚拟变量代替，其它三个变量删除
X_str['reg_preference_for_trad'] = X_str['reg_preference_for_trad'].fillna(X_str['reg_preference_for_trad'].mode()[0])
X_str_dummy = pd.get_dummies(X_str['reg_preference_for_trad'])
X_str_dummy.head()
#合并数值型变量和名义型（字符型）变量
X_cl = pd.concat([X_num,X_str_dummy],axis=1,sort=False)
#X_cl.shape
print(X_cl.head())

#数据标准化和归一化
from sklearn import preprocessing
min_max_scale = preprocessing.MinMaxScaler()
min_max_data = min_max_scale.fit_transform(X_cl)

from sklearn import preprocessing
X_cl = preprocessing.scale(X_cl)

(4754, 84)
(4754, 80)

数据建模和评估(参数优化：网格搜索)

#以三七比例分割训练集和测试集
random_state = 1118
X_train,X_test,y_train,y_test = train_test_split(X_cl,y,test_size=0.3,random_state=1118)
print(X_train.shape)
print(X_test.shape)

def gridSearch_vali(model,param_grid,cv=5):
    print("parameters:{}".format(param_grid))
    grid_search = GridSearchCV(estimator=model,param_grid=param_grid,cv=5,scoring='roc_auc')
    grid_search.fit(X_train,y_train)
    print(grid_search.best_params_)
    return grid_search.best_params_
    


"""
logistic regression
"""
lr_param_temp = {'C':[0.05,0.1,0.5,1],'penalty':['l1','l2']}
lr = LogisticRegression()
lr.set_params(**gridSearch_vali(lr,lr_param_temp))
lr.fit(X_train,y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
print('Train：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('Test：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

"""
决策树
"""
dtc_param_temp = {'max_depth':[3,4,5,6]}
dtc = DecisionTreeClassifier()
dtc.set_params(**gridSearch_vali(dtc,dtc_param_temp))
dtc.fit(X_train,y_train)
y_train_pred = dtc.predict(X_train)
y_test_pred = dtc.predict(X_test)
print('Train：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('Test：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

"""
svm
"""
svm_param_temp = {"gamma":[0.01,0.1],"C":[0.01,1]}
svm = SVC(kernel='linear',probability=True)  
svm.set_params(**gridSearch_vali(svm,svm_param_temp) )
svm.fit(X_train,y_train)
y_train_pred = svm.predict(X_train)
y_test_pred = svm.predict(X_test)
print('Train：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('Test：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

"""
xgboost
"""
xgbc_param_temp = {'max_depth':[5,10],'learning_rate':[0.1,1]}
xgbc = XGBClassifier()
xgbc.set_params(**gridSearch_vali(xgbc,xgbc_param_temp))
xgbc.fit(X_train,y_train)
y_train_pred = xgbc.predict(X_train)
y_test_pred = xgbc.predict(X_test)
print('Train：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('Test：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

"""
lightgbm
"""
lgbc_param_temp = {'max_depth':[5,10],'num_leaves':[20,50]}
lgbc = LGBMClassifier()
lgbc.set_params(**gridSearch_vali(lgbc,lgbc_param_temp))
lgbc.fit(X_train,y_train)
y_train_pred = lgbc.predict(X_train)
y_test_pred = lgbc.predict(X_test)
print('Train：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('Test：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

(3327, 85)
(1427, 85)

parameters:{'C': [0.05, 0.1, 0.5, 1], 'penalty': ['l1', 'l2']}
{'C': 0.05, 'penalty': 'l1'}
Train：0.6524
Test：0.6120


parameters:{'max_depth': [3, 4, 5, 6]}
{'max_depth': 4}
Train：0.6598
Test：0.6029


parameters:{'gamma': [0.01, 0.1], 'C': [0.01, 1]}
{'C': 0.01, 'gamma': 0.01}
Train：0.6108
Test：0.5935


parameters:{'max_depth': [5, 10], 'learning_rate': [0.1, 1]}
{'learning_rate': 0.1, 'max_depth': 5}
Train：0.8898
Test：0.6352


parameters:{'max_depth': [5, 10], 'num_leaves': [20, 50]}
{'max_depth': 5, 'num_leaves': 20}
Train：0.8732
Test：0.6315

探索五种机器学习模型最优参数（某金融数据集）

导入各种包

导入数据

数据理解

数据准备

数据建模和评估(参数优化：网格搜索)

继续阅读

过拟合、欠拟合

大模型必看报告!国际知名机构IDC发布测评，文心大模型3.5登顶国内No.1要问今年哪些概念最火，恐怕大模型会榜上有名。

第二章.互联网情景下的智能学习——bias VS variance & 欠拟合过拟合的概念

集成学习 - Bagging

【skLearn 回归模型】Lasso ---- 选择最佳正则化参数＜带交叉验证的Lasso LassoCV()＞

今天来聊一聊交叉验证对于人工智能技术的助益

拓端数据tecdat|R语言生态学建模：增强回归树（BRT）预测短鳍鳗生存分布和影响因素

机器学习中过拟合问题机器学习中的过拟合问题及解决方法

怎么避免过拟合（正则化，droupout,bagging等原理及特点介绍）1.正则化2.Droupout3.Bagging参考文献

神经网络的过拟合问题以及L1、L2正则化

拓端tecdat|MATLAB用Lasso回归拟合高维数据和交叉验证

拓端数据tecdat|Matlab中的偏最小二乘法（PLS）回归模型，离群点检测和变量选择

插值与拟合（一）（小白必看）

数学模型-插值与拟合matlab实现

拓端tecdat|R语言辅导实现拟合神经网络预测和结果可视化

K-Fold 交叉验证 (Cross-Validation)