多元線性回歸模型檢驗和預測

一、概述

（F檢驗）顯著性檢驗：檢測自變量是否真正影響到因變量的波動。

（t檢驗）回歸系數檢驗：單個自變量在模型中是否有效。

二、回歸模型檢驗

檢驗回歸模型的好壞常用的是F檢驗和t檢驗。F檢驗驗證的是偏回歸系數是否不全為0（或全為0），t檢驗驗證的是單個自變量是否對因變量的影響是顯著的（或不顯著）。

F檢驗和t檢驗步驟：

提出問題的原假設和備擇假設
在原假設的條件下，構造統計量
根據樣本資訊，計算統計量的值
對比統計量的值和理論F分布的值，計算統計量的值超過理論值，則拒絕原假設，否則接受原假設

待檢驗資料集先構造成向量的模式：

可表示成如下形式：

其中β 為n×1的一維向量。

1) 假設

F檢驗假設：

t檢驗假設：

H0為原假設，H1為備擇假設。F檢驗拒絕原假設的條件為計算的F檢驗的值大于查到的理論F值。t檢驗可以通過P值和拟合優度判斷變量的顯著性及模型組合的優劣。

2）計算過程

F檢驗計算過程：

上圖為假設其中一個點所在的平面，由以上點計算出ESS(誤差平方和)，RSS（回歸離差平方和）,TSS（總的離差平方和）。

計算公式為：

其中ESS和RSS都會随着模型的變化而發生變化（估計值變動）。而TSS衡量的是因變量和均值之間的離差平方和，不會随着模型的變化而變化。

由以上公式構造F統計量：

（n為資料集向量的行數，p為列數（自變量的個數），n為離差平方和RSS的自由度，n-p-1為誤差平方和ESS的自由度），模型拟合越好，ESS越小，RSS越大，F值越大。

3）python中計算F值得方法：可以直接通過model.fvalue得到f值或者通過計算得到。

from sklearn import model_selection
import statsmodels.api as sm
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from sklearn import preprocessing

Profit = pd.read_excel(r\'Predict to Profit.xlsx\')
Profit.head()

dummies = pd.get_dummies(Profit.State,prefix = \'State\')
Profit_New = pd.concat([Profit,dummies],axis=1)
Profit_New.drop(labels = [\'State\',\'State_New York\'],axis =1,inplace = True)

train , test = model_selection.train_test_split(Profit_New,test_size = 0.2,random_state=1234)
model = sm.formula.ols(\'Profit~RD_Spend+Administration+Marketing_Spend+State_California+State_Florida\',data = train).fit()

ybar = train.Profit.mean()
p = model.df_model  #自變量個數
n= train.shape[0]    #行數，觀測個數
RSS = np.sum((model.fittedvalues - ybar)**2)  #計算離差平方和 估計值model.fittedvalues
ESS = np.sum(model.resid**2)  #計算誤差平方和，誤差表示model.resid
F = RSS/p/(ESS/(n-p-1))
print( F, model.fvalue)

　　結果如下：

求理論F值，使用Scipy庫子子產品stats中f.ppf

from scipy.stats import f
F_Theroy = f.ppf(q=0.95,dfn = p,dfd = n-p-1)
F_Theroy

由于計算的F值遠遠大于理論F值，是以拒絕原假設，證明多元線性回歸方程是顯著的，偏回歸系數不全為0，即所有自變量聯合起來的組合确實對利潤有顯著性影響。

4) t檢驗手工計算比較複雜，套用python的驗計算方式，model.summary()

參數：R-squared為拟合優度（反映模型對樣本的拟合程度）,Adj.R-squared為修正的可決系數。

通過t檢驗得出，隻有RD_Spend的P值小于0.05，其餘變量均大于0.05，說明其餘變量不是影響利潤Profit的重要因素。

三、回歸模型診斷

實際應用中，因變量若為數值型變量，可以考慮線性回歸模型。模型需滿足如下假設：

　　因變量（殘差）服從正态分布，自變量之間不存在多重共線性，自變量和因變量之間存線上性關系，用于模組化的資料不存在異常值，殘差項滿足方差異性和獨立性。

正态性檢驗（直方圖，pp圖或qq圖，shapiro檢驗和K-S檢驗）
多重共線性檢驗（VIF>10，存在多重共線性，>100則變量間存在嚴重的多重共線性）
線性相關性（資料框的corrwith方法或者seaborn子產品的pairplot函數作圖觀察）|p|~>=0.8 高度相關，0.5《 |p|《0.8 中度相關，0.3《 |p|《0.5 弱相關，|p|《0.3 幾乎不相關
異常值檢驗（帽子矩陣、DFFITS準則、學生化殘差、Cook距離）
殘差獨立性檢驗（summary()方法觀察Durbin-Watson值），DW值在2左右，則殘差項之間不相關，偏離2較遠，則不滿足殘差獨立性假設
方差齊性檢驗（圖形法和BP檢驗）

具體方法過程：

1）正态性檢測（模型的前提假設是殘差服從（0，1）正态分布，實質上由方程y=Xβ+ε，因變量也應該服從正态分布），檢測的是因變量是否符合正态分布

直方圖 scipy.stats seaborn.distplot

import seaborn as sns
import scipy.stats as stats
from pylab import *
mpl.rcParams[\'font.sans-serif\'] = [\'SimHei\']
plt.rcParams[\'axes.unicode_minus\'] = False

sns.distplot(a = Profit_New.Profit,
            bins =10,
            fit = stats.norm,
            norm_hist = True,
            hist_kws = {\'color\':\'green\',\'edgecolor\':\'black\'},
            kde_kws = {\'color\':\'black\',\'linestyle\':\'--\',\'label\':\'核密度曲線\'},
            fit_kws = {\'color\':\'red\',\'linestyle\':\':\',\'label\':\'正态密度曲線\'}
           )
plt.legend()
plt.show()

通過觀察核密度曲線和理論正态密度曲線的拟合程度，2條曲線近似吻合，可認為因變量Profit服從正态分布。

PP圖和QQ圖 ppplot qqplot

pq_plot = sm.ProbPlot(Profit_New.Profit)
pq_plot.ppplot(line=\'45\')
plt.title(\'PP圖\')
pq_plot.qqplot(line=\'q\')
plt.title(\'QQ圖\')
plt.show()

觀察到散點均均勻的分布在直線上，說明因變量近似服從正态分布。

Shapiro檢驗和K-S檢驗（因變量資料量>5000 use K-S檢驗法，<5000 use Shapiro檢驗）

stats.shapiro(Profit_New.Profit)

　　輸出為

(0.9793398380279541, 0.537902295589447)，第一個值為shapiro檢驗的統計量值，第二個值為對應機率p值,p值大于0.05置信水準，利潤變量服從正态分布假設。

K-S檢驗 KSValue = stats.kstest(rvs = datay,args = (datay.mean(),datay.std(),cdf = \'norm\'))
需要注意的是KS檢驗，必須指定args參數，來傳遞檢驗變量的均值和标準差。

2）多重共線性檢驗（自變量之間的檢測）

檢測模型中自變量之間存在較高的線性相關關系，若存在會給模型帶來嚴重的後果。其檢驗可以通過方差膨脹因子VIF來鑒定。（VIF>10，存在多重共線性，>100則變量間存在嚴重的多重共線性）

消除多重共線性，可以通過不同自變量之間的組合觀察vif值，确定最佳組合。R語言中消除多重共線性，确定最優組合使用模型的step方法（逐漸回歸）。

計算方式：

X1為第一個自變量。構造x1與其餘自變量的線性回歸模型，得到回歸模型的判決系數R2，得到第一個自變量的膨脹因子，依次類推計算剩餘變量的VIF。

Python 的statsmodels子產品可直接計算VIF值

from statsmodels.stats.outliers_influence import variance_inflation_factor
X = sm.add_constant(Profit_New.ix[:,[\'RD_Spend\',\'Marketing_Spend\']])
vif = pd.DataFrame()
vif[\'features\'] = X.columns
vif["VIF Factor"] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]

得到vif的值：

計算所有的變量:

X = sm.add_constant(Profit_New.drop(\'Profit\',axis =1).ix[:])
vif = pd.DataFrame()
vif[\'features\'] = X.columns
vif["VIF Factor"] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
vif

vif值:

第二種模型得到的const的值大于10，說明構模組化型的變量之間存在多重共線性，需要删除變量重新選擇模型。第一種模式選擇2個自變量（\'RD_Spend\',\'Marketing_Spend\'）是符合自變量之間無線性相關要求的。

3）線性相關性檢驗（各變量之間線性相關性檢測）

圖形法 seaborn pariplot函數

sns.pairplot(Profit_New.ix[:,[\'RD_Spend\',\'Administration\',\'Marketing_Spend\',\'Profit\']]) #去除啞變量
plt.show()

　　散點圖矩陣圖示：

觀察散點圖分布得出，RD_Spend，Marketing_Spend兩個變量和因變量Profit之間存線上性關系，Administration與Profit之間散點圖呈水準趨勢，說明2者之間不相關。

資料框的corrwith方法(該方式優點是可以計算任意指定變量之間的相關系數)

判斷條件：

# Profit 和各變量之間的相關系數
Profit_New.drop([\'Profit\',\'State_California\',\'State_Florida\'],axis =1).corrwith(Profit_New.Profit)

結果看出：自變量RD_Spend和Marketing_Spend與因變量之間存在較高的線性相關性，而其他變量與利潤之間線性相關性比較弱。可忽略其線性相關關系。

需要注意的是通過corrwith檢測變量之間不存線上性相關關系，結合seaborn.pariplot觀察，可能存在其他關系如2次方，3次方，倒數關系等等（在變量通過t檢驗被确認為對因變量影響是顯著的情況下，還需要對模型再次檢查）。

通過如下方式重新确定模型為： Profit = 51902.112471 + 0.785116RD_Spend + 0.019402Marketing_Spend

model3 = sm.formula.ols(\'Profit~RD_Spend + Marketing_Spend \',data = train).fit()
model3.params

4）異常值檢測（帽子矩陣、DFFITS準則、學生化殘差、Cook距離）

确認方式：|學生化殘差| > 2，則對應的資料點為異常值。

對異常值得處理：異常值檢測 model.get_influence() ，學生化殘內插補點 model.get_influence().resid_studentized_external

确認異常樣本的比例，比例<5%，考慮可以直接删除異常值，>5%，可以生成啞變量，對于異常點，設定啞變量為1，否則為0。

# 異常值檢測
outliers = model3.get_influence()   

# 帽子矩陣
leverage = outliers.hat_matrix_diag
# dffits值
dffits= outliers.dffits[0]
# 學生化殘差
resid_stu = outliers.resid_studentized_external
# cook距離
cook = outliers.cooks_distance[0]


# 合并各種異常值檢驗的統計量值
contatl = pd.concat([pd.Series(leverage, name = \'leverage\'),
                     pd.Series(dffits, name = \'dffits\'),
                     pd.Series(resid_stu, name = \'resid_stu\'),
                     pd.Series(cook, name = \'cook\')
                    ],axis =1 )

train.index = range(train.shape[0])
profit_outliers = pd.concat([train,contatl ],axis =1)
profit_outliers.head()

使用非異常值的資料點進行模組化：

# 計算異常值比例
outliers_ratio = np.sum(np.where((np.abs(profit_outliers.resid_stu)>2),1,0))/profit_outliers.shape[0]
outliers_ratio  # 0.02564

none_outliers = profit_outliers.ix[np.abs(profit_outliers.resid_stu)<=2,]
model4 = sm.formula.ols(\'Profit~RD_Spend+Marketing_Spend\',data = none_outliers).fit()
model4.params

再次更新模型為：Profit = 51827.416821 + 0.797038RD_Spend + 0.01774Marketing_Spend

5）殘差獨立性檢驗

通過mdoel.summary函數觀察Durbin-Watson的值，在2左右說明殘差之間是不相關的，偏離2較遠，說明殘差之間是不獨立的。

model4.summary()，DW的值為2.065，說明模型殘差滿足獨立性假設。

6）殘差方差齊性檢驗（圖形法和BP法）

目的：殘差項的方差不随自變量的變動而呈現某種趨勢，即殘差項的方差不受自變量變化而影響

圖形法：繪制殘差和自變量之間的散點圖

ax1 = plt.subplot2grid(shape = (2,1) , loc = (0,0)) # 設定第一張子圖位置
# 散點圖繪制
# ax1.scatter(none_outliers.RD_Spend, none_outliers.resid_stu )  #學生化殘差與自變量散點圖
ax1.scatter(none_outliers.RD_Spend, (model4.resid - model4.resid.mean())/model4.resid.std()) #标準化殘差和自變量散點圖
# 添加水準參考線
ax1.hlines(y = 0 ,
          xmin = none_outliers.RD_Spend.min(),
           xmax = none_outliers.RD_Spend.max(),
           color = \'red\',
           linestyle = \'--\'
          )
ax1.set_xlabel(\'RD_Spend\')
ax1.set_ylabel(\'Std_Residual\')
ax2 = plt.subplot2grid(shape = (2,1) , loc = (1,0))
# ax1.scatter(none_outliers.Marketing_Spend, none_outliers.resid_stu )
ax2.scatter(none_outliers.Marketing_Spend, (model4.resid - model4.resid.mean())/model4.resid.std())
ax2.hlines(y = 0 ,
          xmin = none_outliers.Marketing_Spend.min(),
           xmax = none_outliers.Marketing_Spend.max(),
           color = \'magenta\',
           linestyle = \'--\'
          )
ax2.set_xlabel(\'Marketing_Spend\')
ax2.set_ylabel(\'Std_Residual\')

# 調整2子圖之間距離
plt.subplots_adjust(hspace = 0.6,wspace = 0.3)
plt.show()

　注意的是采用學生化殘差或标準化殘差繪制和自變量的散點圖均可以，2者圖形類似。

原因是學生化殘差計算公式為：

圖形結果如下：

根據結果可知标準化殘差并沒有随着自變量變化而呈現某種趨勢，圖形中所有點均勻的分布在y=0兩側，可以認為殘差項滿足方差齊性假設。

方差齊性檢測完整代碼：

from sklearn import model_selection
import statsmodels.api as sm
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

Profit = pd.read_excel(r\'Predict to Profit.xlsx\')

dummies = pd.get_dummies(Profit.State,prefix = \'State\')
Profit_New = pd.concat([Profit,dummies],axis=1)
Profit_New.drop(labels = [\'State\',\'State_New York\'],axis =1,inplace = True)
train , test = model_selection.train_test_split(Profit_New,test_size = 0.2,random_state=1234)
model3 = sm.formula.ols(\'Profit~RD_Spend + Marketing_Spend \',data = train).fit()
# 異常值檢測
outliers = model3.get_influence()   

# 帽子矩陣
leverage = outliers.hat_matrix_diag
# dffits值
dffits= outliers.dffits[0]
# 學生化殘差
resid_stu = outliers.resid_studentized_external
# cook距離
cook = outliers.cooks_distance[0]
contatl = pd.concat([pd.Series(leverage, name = \'leverage\'),
                     pd.Series(dffits, name = \'dffits\'),
                     pd.Series(resid_stu, name = \'resid_stu\'),
                     pd.Series(cook, name = \'cook\')
                    ],axis =1 )

train.index = range(train.shape[0])
profit_outliers = pd.concat([train,contatl ],axis =1)

outliers_ratio = np.sum(np.where((np.abs(profit_outliers.resid_stu)>2),1,0))/profit_outliers.shape[0]

none_outliers = profit_outliers.ix[np.abs(profit_outliers.resid_stu)<=2,]
model4 = sm.formula.ols(\'Profit~RD_Spend+Marketing_Spend\',data = none_outliers).fit()

ax1 = plt.subplot2grid(shape = (2,1) , loc = (0,0)) # 設定第一張子圖位置
# 散點圖繪制
# ax1.scatter(none_outliers.RD_Spend, none_outliers.resid_stu )
ax1.scatter(none_outliers.RD_Spend, (model4.resid - model4.resid.mean())/model4.resid.std())
# 添加水準參考線
ax1.hlines(y = 0 ,
          xmin = none_outliers.RD_Spend.min(),
           xmax = none_outliers.RD_Spend.max(),
           color = \'red\',
           linestyle = \'--\'
          )
ax1.set_xlabel(\'RD_Spend\')
ax1.set_ylabel(\'Std_Residual\')
ax2 = plt.subplot2grid(shape = (2,1) , loc = (1,0))
# ax1.scatter(none_outliers.Marketing_Spend, none_outliers.resid_stu )
ax2.scatter(none_outliers.Marketing_Spend, (model4.resid - model4.resid.mean())/model4.resid.std())
ax2.hlines(y = 0 ,
          xmin = none_outliers.Marketing_Spend.min(),
           xmax = none_outliers.Marketing_Spend.max(),
           color = \'magenta\',
           linestyle = \'--\'
          )
ax2.set_xlabel(\'Marketing_Spend\')
ax2.set_ylabel(\'Std_Residual\')

# 調整2子圖之間距離
plt.subplots_adjust(hspace = 0.6,wspace = 0.3)
plt.show()

BP法（拉格朗日乘子檢驗）：statsmodels子產品het_breuschpagan函數 https://programtalk.com/python-examples/statsmodels.stats/

sm.stats.diagnostic.het_breuschpagan(model4.resid, exog_het = model4.model.exog)

結果得到4個值

(1.4675103668308342, 0.48010272699006384, 0.7029751237162462, 0.5019659740962872)

第一個值為LM統計量，第二個為p值，第三個為F統計量，第四個為F統計量的p值，通過觀察p值均大于0.05，證明接受殘差方差齊性原假設，即殘差不受自變量的影響而變化。

若果殘差不滿足方差齊性假設，觀察殘差和自變量的關系，通過模型變換或權重最小二乘法對模型加以處理。

四、回歸模型預測

pred4 = model4.predict(exog = test.ix[:,[\'RD_Spend\',\'Marketing_Spend\']])
plt.scatter(x = test.Profit ,y = pred4)
plt.plot([ test.Profit.min(),test.Profit.max()],
         [test.Profit.min(),test.Profit.max()],
        color = \'red\',
         linestyle = \'--\'
)
plt.xlabel(\'實際值\')
plt.ylabel(\'預測值\')
plt.show()

　真實值和預測值之間差異：

print(pd.DataFrame({"prediction":pred4,\'real\':test.Profit}))

圖形展示效果：