整个建模流程

今天梳理一下最近总结的逻辑回归评分卡建模流程，大差不差，慢慢补充

文章目录

- - 1. 导入包
  - 2. 数据及预处理
  - 3. 数据描述
  - 4. 特征工程
  - 5. 建模
  - 5. 制作评分卡以及分数评估
  - 6. 分析稳定性
  - 7. 保存结果

1. 导入包

：除了基础的包，其他的就是同事大佬总结的以及我参照后改成自己的风格的包

import pandas as pd
import numpy as np
import statsmodels.api as sm # 添加常数项
import matplotlib.pyplot as plt # 画分布图
import seaborn as sns
import importlib #重新导入包
import warnings
warnings.filterwarnings('ignore')

from pandas.io.excel import ExcelWriter #写入excel
# 开发包
from riskmodels import detector, evalueate, models, scorecard,utils

2. 数据及预处理

：首先需要导入数据，这里如果数据量比较大可以存为csv格式，但是csv格式有时候会改变变量的类型，所以我现在就是将从hive里面下载下来的数据存储为pickle格式。

这里的数据满足的要求可能需要包括：时间（month），产品（product_no），主键（id）, 标签target

接着，可以考虑是否去掉灰样本，有时候可能需要分析，如果不需要的话，在读入数据之前就可以处理掉

其次，就是数据的空值替换，比如-999, None等

再次，就是某些特别的数据的类型转换

或者，可能涉及修改变量名

total_data=pd.read_pickle('data.pkl')
total_data=detector.data_pro(total_data)
total_data.rename(columns={'a.create_date':'create_date'})

3. 数据描述

这里主要是一些描述性的东西，比如数据按月分布，按产品分布情况，数据集的划分情况等

# 按月描述
t_m=detector.sample_stats(total_data,'month',target='y')

# 划分数据
total_data['month']=pd.to_datetime(total_data['month']

# 按照不同的月份来划分train和oot得到的一个结果
split_stat_dat=detector.split_stat(total_data,'month',target='y')
# 查看哪个月份划分是最合理的，然后进行划分
train_df, test_df,oot_df=detector.train_test_oot(total_data,'month',oot_date='2022-02-01')

# 想要一个完整的数据，方便后面对总数据进行分析：分布、分数排序性等
train_df['tto']=1
test_df['tto']=2
oot_df['tto']=3
t_data=pd.concat([train_df,test_df,oot_df],axis=0)
t_data.sort_index(inplace=True)

# 查看数据集的分布
set_stat=detector.sample_stat(t_data,'tto',target='y']

4. 特征工程

这里主要是特征的筛选，包括单一值、缺失值、分箱、iv值、PSI、风险一致性、相关性等

# X数值型变量列表，注意是数值型
fea_0=total_data.drop([],axis=1).columns.to_list()
# 加一个数据的描述性统计
det_data=detector.detect(total_data[fea_0])
# 缺失率和单一值率筛选
na_repeat,na_stat,repeat_stat=evaluate.distribution_statistics(total_data,fea_0)
fea_1=evalueate.na_select(na_repeat,fea_0)
fea_2=evaluate.repeat_select(na_repeat,fea_1)

#woe分箱
bins=scorecard.woebin(train_df,x=fea_2,y='y',method=['hist','chi2']) #这个是修改了内部参数的
brks,spe=scorecard.woebin_breaks(bins) #获取分箱结果
woe_df,iv_df=scorecard.sc_bins_to_df(bins)

#iv和单调性筛选
fea_3=iv_df[iv_df['IV']>0.02 & iv_df['单调性'].isin(['increasing','decreasing'])].index.to_list()

# psi筛选
var_psi=scorecard.woebin_psi(train_df,oot_df,bins={k:v for k,v in bins.items() if k in fea_3})
var_psi_t=var_psi[['variable','psi']].groupby('variable').mean().reset_index()
fea_4=[var_psi_t[x,0] for x in range(var_psi_t.shape[0]) if var_psi_t.iloc[x,1]<0.1]

#风险一致性筛选
var_risk_consist=var_select.risk_trends_consistency(oot_df,sc_bins={v:bins[v] for v in fea_4},target='y')

fea_5=[k for k,v in var_risk_consist.items() if v==1]

# 相关性筛选
fea_6=var_select.correlation_select(train_df, fea_5)

5. 建模

建模部分主要用到的是逐步逻辑回归

train_X=scorecard.woebin_ply(train_df[fea_6],bins,value='woe')
train_y=train_df['y']
test_X=scorecard.woebin_ply(test_df[fea_6],bins,value='woe')
test_y=test_df['y']
oot_X=scorecard.woebin_ply(oot_df[fea_6],bins,value='woe')
oot_y=oot_df['y']

# 逐步逻辑回归
_, fea_7=models.stepwise_lr(train_X,train_y.values,kfold=5,features=[f+'woe' for f in fea_6],selection_criterion=1,watch_sample(test_X,test_y.values))
# selection_criterion是模型评价准则，一般选择的是auc

# 逻辑回归及调整，包括去掉p值大于阈值的，并输出模型表达结果+vif+变量的相关系数
lr_mdoel_result,selected_variables,model_perf,cor_f=models.lr_select(train_X,train_y,f_7,p_limit=0.5)

# 查看模型效果 测试集和OOT
model_ef,train_pred,test_pred,oot_pred=models.model_effect(lr_model_result,seelcted_variables,train_X,test_X,oot_X,train_y,test_y,oot_y)

# model_ef是在训练测试和OOT上的效果表现，通常三个数据集的ks应该尽可能接近，最好不超过3个点，并且KS最好在0.2以上，如果效果不好的话需要继续返回调整

模型效果不好调整的地方，目前想到的是单调性那里少一些限制，然后修改分箱，使其保持单调，满足业务单调性

5. 制作评分卡以及分数评估

模型建立好了之后，需要转换成评分卡并观察各种情况下的一些指标，比如说分数分布，分数排序性，按月表现如何等

# 制作评分卡
score_card=scorecard.make_scorecard(bins,lr_model_result.params.to_dict())

# 初步看一下排序性
train_split,test_split,oot_split=utils.score_badrate(lr_model_result,selected_feaures,train_pred,test_pred,oot_pred,train_y,test_y,oot_y)

这里如果排序性不好也需要倒回去调整，直到OOT上的坏账率有排序性

6. 分析稳定性

如果模型确认好了，在OOT上排序性也比较好了，接下来就要看一下模型的稳定性，主要是将整个数据集拿来分析，包括其分布是否满足正态性，好坏样本分布是否有区分度，按月的模型的效果是否稳定

tt_data=scorecard.woebin_ply(t_data[var[:-4] for var in selected_variables]
tt_data=sm.add_constant()
tt_data['pred']=lr_model_result.predict(tt_data)
tt_data['score']=utils.pfk(lr_model_result,tt_data['pred'])
tt_data['y']=t_data['y']
tt_data['month']=t_data['month']
# 如果有产品号的话最好加上，便于后面对产品进行分析

# 总体分数的排序性
t_split,sp_breaks=utils.split_performance_table(tt_data['y'],tt_data['score'])

# 总体数据的分布
df_fig,ax1=plt.subplots(figsize=(12,9))
sns.displot(tt_data[tt_data['y']==1]['score'],ax=ax1,hist=True,kde=True,norm_hist=True,bins=10,color='green',label='bad')
sns.displot(tt_data[tt_data['y']==0]['score'],ax=ax1,hist=True,kde=True,norm_hist=True,bins=10,color='red',label='good')
ax1.legend()
plt.savefig('总体数据分布')
# 按月的模型效果，主要是Ks
month_ks=evaluate.anyue_ks(tt_data,'y')

# 有多个产品的话也可以看看在产品上的模型效果表现情况

一般如果有多个产品的话，那么做的就是一个通用模型，通常由于多个项目的客群会不一样，所以会导致通用模型的效果会很不好，此时就需要去将客群一样的项目建立通用模型，单个项目建立专属模型了

到这里的话如果满足正态性、稳定性的话，这个模型开发工作基本上就完成了一大半，接下来就是验证这个模型在其他标签上的效果就OK啦

7. 保存结果

writer=ExcelWriter('模型开发结果汇总')
t_m.to_excel(writer,sheet_name='样本',startrow=1)
na_stat.to_excel(writer,sheet_name='变量统计',startrow=1)
...
writer.close()

整个建模流程

文章目录

1. 导入包

2. 数据及预处理

3. 数据描述

4. 特征工程

5. 建模

5. 制作评分卡以及分数评估

6. 分析稳定性

7. 保存结果

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入