信用評分系統運作原理下篇

前言

信用評分系統運作原理上篇

信用評分系統運作原理中篇-分箱邏輯

繪制相關性系數熱力圖

corr = train.corr() # 計算各變量的相關性系數
xticks = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10'] # x軸标簽
yticks = list(corr.index) # y軸标簽
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
sns.heatmap(corr, annot=True, cmap='rainbow', ax=ax,
            annot_kws={'size': 12, 'weight': 'bold', 'color': 'blue'}) # 繪制相關性系數熱力圖
ax.set_xticklabels(xticks, rotation=0, fontsize=12)
ax.set_yticklabels(yticks, rotation=0, fontsize=12)
plt.show()

上圖可以看出變量之間的相關性都較小，但是 NumberOfOpenCreditLinesAndLoans 和 NumberRealEstateLoansOrLines 相對來說較大為0.43

将各個特征的IV值顯示在柱狀圖上

ivlist = [ivx1, ivx2, ivx3, ivx4, ivx5, ivx6, ivx7, ivx8, ivx9, ivx10]  # 各變量IV
index = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']  # x軸的标簽
fig1 = plt.figure(figsize=(8, 8))
ax1 = fig1.add_subplot(1, 1, 1)
x = np.arange(len(index)) + 1
ax1.bar(x, ivlist, width=0.4)  # 生成柱狀圖
ax1.set_xticks(x)
ax1.set_xticklabels(index, rotation=0, fontsize=12)
ax1.set_ylabel('IV(Information Value)', fontsize=12)
# 在柱狀圖上添加數字标簽
for a, b in zip(x, ivlist):
    plt.text(a, b + 0.01, '%.4f' % b, ha='center', va='bottom', fontsize=10)
plt.show()

通過IV值判斷變量預測能力的标準是：
 
< 0.02: unpredictive
0.02 to 0.1: weak
0.1 to 0.3: medium
0.3 to 0.5: strong
> 0.5: suspicious
 
DebtRatio、MonthlyIncome、NumberRealEstateLoansOrLines 和 NumberOfDependents 變量的IV值明顯較低

WOE轉換

證據權重（Weight of Evidence,WOE）轉換可以将Logistic回歸模型轉變為标準評分卡格式

# 替換成woe函數
def replace_woe(series, cut, woe):
    list = []
    i = 0
    while i < len(series):
        value = series[i]
        j = len(cut) - 2
        m = len(cut) - 2
        while j >= 0:
            if value >= cut[j]:
                j = -1
            else:
                j -= 1
                m -= 1
        list.append(woe[m])
        i += 1
    return list

train['RevolvingUtilizationOfUnsecuredLines'] = Series(
    replace_woe(train['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1))

這段代碼的意思是 擷取每一行的值

依次和 0.75分位、0.5分位、0.25分位的值比較

要是目前值大于某一分位的值則記錄該分位的值

交叉驗證可以進行模型選擇

概念

将訓練資料集劃分為K份，K一般為10
依次取其中一份為驗證集，其餘為訓練集訓練分類器，測試分類器在驗證集上的精度 
取K次實驗的平均精度為該分類器的平均精度

導入庫

# 導入邏輯回歸
from sklearn.linear_model import LogisticRegression
# 導入分類器
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
# k最近鄰分類
from sklearn.neighbors import KNeighborsClassifier

建立分類器執行個體

knMod = KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2,
                             metric='minkowski', metric_params=None)
                             
                             
lrMod = LogisticRegression(penalty='l1', dual=False, tol=0.0001, C=1.0, fit_intercept=True,
                           intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100,
                           multi_class='ovr', verbose=2)

adaMod = AdaBoostClassifier(base_estimator=None, n_estimators=200, learning_rate=1.0)


gbMod = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=200, subsample=1.0,
                                   min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3,
                                   init=None, random_state=None, max_features=None, verbose=0)
                                   

rfMod = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2,
                               min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
                               max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None,
                               verbose=0)

使用分類算法cross_val_score

# 入參是訓練資料資訊

def cvDictGen(functions, scr, X_train=X_train, Y_train=Y_train, cv=10, verbose=1):
    cvDict = {}
    for func in functions:
        # cross_val_score将交叉驗證的整個過程連接配接起來，不用再進行手動的分割資料
        # cv參數用于規定将原始資料分成多少份
        cvScore = cross_val_score(func, X_train, Y_train, cv=cv, verbose=verbose, scoring=scr)
        cvDict[str(func).split('(')[0]] = [cvScore.mean(), cvScore.std()]

    return cvDict
    
cvD = cvDictGen(functions=[knMod, lrMod, adaMod, gbMod, rfMod], scr='roc_auc')    
    
def cvDictNormalize(cvDict):
    cvDictNormalized = {}
    for key in cvDict.keys():
        for i in cvDict[key]:
            cvDictNormalized[key] = ['{:0.2f}'.format((cvDict[key][0] / cvDict[list(cvDict.keys())[0]][0])),
                                     '{:0.2f}'.format((cvDict[key][1] / cvDict[list(cvDict.keys())[0]][1]))]
    return cvDictNormalized
    
    
cvDictNormalize(cvD)

得到

cvD:

'KNeighborsClassifier': [0.5887365163416062, 0.011300179653818953], 'LogisticRegression': [0.8500902765971645, 0.0036164412715674102], 'AdaBoostClassifier': [0.8583319753215507, 0.004050825383307547], 'GradientBoostingClassifier': [0.8639129158346284, 0.003503053433053003], 'RandomForestClassifier': [0.7803945135123486, 0.010025212199131]}

平均值、方差

标準化處理結果：

{'KNeighborsClassifier': ['1.00', '1.00'], 'LogisticRegression': ['1.44', '0.33'], 'AdaBoostClassifier': ['1.46', '0.36'], 'GradientBoostingClassifier': ['1.47', '0.31'], 'RandomForestClassifier': ['1.33', '0.87']}

最優化超參數

導入庫

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

AdaBoost模型

訓練參數

ada_param = {'n_estimators': [10, 50, 100, 200, 400],
             'learning_rate': [0.1, 0.05]}

模型訓練

randomizedSearchAda = RandomizedSearchCV(estimator=adaMod, param_distributions=ada_param, n_iter=5,
                                         scoring='roc_auc', cv=None, verbose=2).fit(X_train, Y_train)
                                         

RandomizedSearchCV參數說明，clf1設定訓練的學習器
param_dist字典類型，放入參數搜尋範圍
scoring = 'roc_auc'，精度評價方式設定為“roc_auc“
n_iter=300，訓練300次，數值越大，獲得的參數精度越大，但是搜尋時間越長
n_jobs = -1，使用所有的CPU進行訓練，預設為1，使用1個CPU

模型訓練情況：

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] n_estimators=10, learning_rate=0.1 ..............................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[CV] ............... n_estimators=10, learning_rate=0.1, total=   0.5s
[CV] n_estimators=10, learning_rate=0.1 ..............................
[CV] ............... n_estimators=10, learning_rate=0.1, total=   0.5s
[CV] n_estimators=10, learning_rate=0.1

該算法傳回：

randomizedSearchAda.best_params_, randomizedSearchAda.best_score_

best_params_ = {'n_estimators': 200, 'learning_rate': 0.1}

輸出最優訓練器的精度
best_score_ = 0.8583

GB模型

gbParams = {'loss': ['deviance', 'exponential'],'n_estimators': [10, 50, 100, 200, 400],'max_depth': randint(1, 5),'learning_rate': [0.1, 0.05]}

randomizedSearchGB = RandomizedSearchCV(estimator=gbMod, param_distributions=gbParams, n_iter=10,scoring='roc_auc', cv=None, verbose=2).fit(X_train, Y_train)

randomizedSearchGB.best_params_, randomizedSearchGB.best_score_

訓練的過程

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] learning_rate=0.1, loss=exponential, max_depth=2, n_estimators=200 
[CV]  learning_rate=0.1, loss=exponential, max_depth=2, n_estimators=200, total=  12.4s
[CV] learning_rate=0.1, loss=exponential, max_depth=2, n_estimators=200 
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.4s remaining:    0.0s
[CV]  learning_rate=0.1, loss=exponential, max_depth=2, n_estimators=200, total=  12.8s

randomizedSearchGB.best_params_, randomizedSearchGB.best_score_


best_params_= {'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 2, 'n_estimators': 200}

best_score_=0.8634

用最優的分類容再訓練資料

bestGbModFitted = randomizedSearchGB.best_estimator_.fit(X_train, Y_train)

bestAdaModFitted = randomizedSearchAda.best_estimator_.fit(X_train, Y_train)

再評估模型

cvDictHPO = cvDictGen(functions=[bestGbModFitted, bestAdaModFitted], scr='roc_auc')

cvDictHPO: {'GradientBoostingClassifier': [0.8266652916225066, 0.0051271698988066315], 'AdaBoostClassifier': [0.8551661366453513, 0.003975186847574813]}

資料标準化

cvDictNormalize(cvDictHPO)

cvDictNormalized: {'GradientBoostingClassifier': ['1.00', '1.00'], 'AdaBoostClassifier': ['1.03', '0.78']}

畫ROC曲線

代碼

def plotCvRocCurve(X, y, classifier, nfolds=5):
    # 導入roc_curve（roc曲線庫）、auc面積庫
    from sklearn.metrics import roc_curve, auc
    # 導入StratifiedKFold k折交叉拆分 分層采樣，確定訓練集，測試集中各類别樣本的比例與原始資料集中相同
    from sklearn.model_selection import StratifiedKFold
    # 導入畫圖 pyplot庫
    import matplotlib.pyplot as plt

    # cv = StratifiedKFold(y, n_folds=nfolds)
    cv = StratifiedKFold(n_splits=4, random_state=0, shuffle=True)

    mean_tpr = 0.0
    # 在指定的間隔内傳回均勻間隔的數字
    # numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)
    # 0-1之間 100個數字 均勻間隔的數字
    mean_fpr = np.linspace(0, 1, 100)
    all_tpr = []

    # for i, (train, test) in enumerate(cv):
    i = 0
    # 分層拆分資料集 為 訓練集和測試集 確定拆分後的資料集和原始資料中的比例相同
    for train, test in cv.split(X, y):
        # 使用 梯度提升樹GradientBoostingClassifier算法 訓練資料
        # 使用測試資料進行預測
        probas_ = classifier.fit(X.iloc[train], y.iloc[train]).predict_proba(X.iloc[test])
        # 根據預測結果和實際結果計算門檻值、fpr（實際樣本中預測錯誤的樣本比例）、tpr(等于recall 實際的樣本中預測正确的樣本所在比例)
        # 假如門檻值是 0.2 小于0.2的值 認為是錯誤的 大于0.2的值表示正确的 則就可以計算這個門檻值下的tpr和fpr了
        fpr, tpr, thresholds = roc_curve(y.iloc[test], probas_[:, 1])
        # 一維線性插值.
        # 傳回離散資料的一維分段線性插值結果.
        #
        # 參數
        # x: 數組
        # 待插入資料的橫坐标
        # 橫坐标是fpr 縱坐标tpr
        mean_tpr += np.interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))
        i = i+1

    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')

    # mean_tpr /= len(cv)
    mean_tpr /= cv.n_splits
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, 'k--',
             label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('CV ROC curve')
    plt.legend(loc="lower right")
    fig = plt.gcf()
    fig.set_size_inches(15, 5)

    plt.show()

代碼思路分析

1、将原始資料分層同比例拆分訓練和測試集

比如将原始資料拆分成 5個訓練集和5個測試集合

a 把資料集排成一串

b 然後依次按照元素順序分割成K份子集

c然後每個子集充當一次測試集，這樣可以使得每個樣本都在測試集中出現

注意：

a 這種劃分沒有随機性，因為每一份測試集都是從前往後有序排列的

b 是以該方法使用之前需要将資料集執行打亂操作(KFold中設定shuffle=True即可）

c 注意：random_state參數預設為None，如果設定為整數，每次運作的結果是一樣的。

2、通過訓練資料訓練模型得到模型之後預測測試資料

3、根據實際測試資料的結果和測試資料的預測結果來計算每一個門檻值的tpr和fpr

4、根據tpr和fpr畫auc曲線

處理完一個分組中的訓練和測試資料就會生成一個roc曲線

計算auc面積

最終的ROC曲線

其中的area就是auc面積

計算ROC曲線最佳位置

代碼

def rocZeroOne(y_true, y_predicted_porba):
    from sklearn.metrics import roc_curve
    from scipy.spatial.distance import euclidean

    fpr, tpr, thresholds = roc_curve(y_true, y_predicted_porba[:, 1])

    best = [0, 1]
    dist = []
    for (x, y) in zip(fpr, tpr):
        dist.append([euclidean([x, y], best)])

    bestPoint = [fpr[dist.index(min(dist))], tpr[dist.index(min(dist))]]

    bestCutOff1 = thresholds[list(fpr).index(bestPoint[0])]
    bestCutOff2 = thresholds[list(tpr).index(bestPoint[1])]

    print('\n' + 'ROC曲線最佳點位置: TPR = {:0.3f}%, FPR = {:0.3f}%'.format(bestPoint[1] * 100, bestPoint[0] * 100))
    print('\n' + '最佳截止點: {:0.4f}'.format(bestCutOff1))

    plt.plot(dist)
    plt.xlabel('Index')
    plt.ylabel('Euclidean Distance to the perfect [0,1]')
    fig = plt.gcf()
    fig.set_size_inches(15, 5)

原理分析

1、該函數的2個入參數一個是測試資料真實結果一個是該測試資料的預測結果

2、計算ROC曲線上距離左上角最近的點就是最好的位置

左上角的點是(0,1)

在數學中，歐幾裡得距離或歐幾裡得度量是歐幾裡得空間中兩點間“普通”（即直線）距離

函數則對應 python庫

scipy.spatial.distance import euclidean

邏輯回歸

邏輯回歸計算結果