一進制線性回歸模型的介紹與應用

一進制線性回歸模型

回歸方程形式：

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

，i=1,2,...n,其中

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

需滿足以下四個假設條件

a.正态性假設，即

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

是服從正态分布的随機變量

b.無偏性假設，即E(

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

)=0

c.同方差性假設，即所有

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

的方差都相同；同時也說明了

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

與自變量，因變量之間都是互相獨立的

d.獨立性假設，

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

之間互相獨立，且滿足COV(

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

)=0(i

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

根據誤差平方和最小的原則，用最小二乘法求解參數a,b：

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

一進制回歸模型python的實作：

import pandas as pd
import statsmodels.api as sm
income = pd.read_csv(r'D:\download\python資料分析與挖掘\第11章 線性回歸模型\Salary_Data.csv')
# 建構回歸模型
fit = sm.formula.ols('Salary~YearsExperience',data = income).fit()
fit.params  # 傳回參數


'''
sm.ols(formula, data, subset=None, drop_cols=None)

formula：以字元串的形式指定線性回歸模型的公式，如'y~x'就表示簡單線性回歸模型
data：指定模組化的資料集
subset：通過bool類型的數組對象，擷取data的子集用于模組化
drop_cols：指定需要從data中删除的變量
'''

資料集：

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

結果：

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

檢視兩變量相關性：

income.Salary.corr(income.YearsExperience) #0.98

多元線性回歸模型的系數推導

多元回歸模型的定義

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

多元回歸模型python的實作：

資料集：

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

檢視相關性

profit.drop('State',axis=1).corrwith(profit.Profit) # Profit變量與其他三個連續變量的相關性

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

profit.drop('State',axis=1).corr() # 四個連續變量兩兩之間的相關性

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

from sklearn import model_selection
profit = pd.read_excel(r'D:\download\python資料分析與挖掘\第11章 線性回歸模型\Predict to Profit.xlsx')
# profit.head(1)

# 将資料拆分為訓練集和測試集
train,test = model_selection.train_test_split(profit,test_size = 0.2,random_state=1234)
# 根據train資料模組化
model = sm.formula.ols('Profit~RD_Spend+Administration+Marketing_Spend+C(State)',data=train).fit() # State加C()表示分類變量,State有三個類别
print("模型的偏回歸系數分别為：\n",model.params)

# 删除test資料集中的Profit變量，用剩下的自變量進行預測
test_X = test.drop(labels='Profit',axis=1)
pred=model.predict(exog=test_X)
print("對比預測值和實際值的差異：\n",pd.DataFrame({'Prediction':pred,'Real':test.Profit}))

結果如下，state預設将離散變量State的California作為對照組（共生性）。下面另一段代碼是手動為State設定啞變量，并砍掉New York做的預測結果

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

# 生成啞變量
dummies = pd.get_dummies(profit.State)
# 新增啞變量列，生成新資料
profit_new = pd.concat([profit,dummies],axis=1)
# 删除State變量和California變量（因為State變量已被分解為啞變量，New York變量需要作為參照組）
profit_new.drop(labels = ['State','New York'],axis = 1,inplace = True)
train,test=model_selection.train_test_split(profit_new,test_size=0.2,random_state=1234)
model = sm.formula.ols('Profit~RD_Spend+Administration+Marketing_Spend+Florida+California',data = train).fit()
print('模型的偏回歸系數分别為：\n',model.params)

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

線性回歸模型的假設檢驗

模型的F檢驗

提出問題的原假設和備擇假設
在原假設的條件下，構造統計量F
根據樣本資訊，計算統計量的值
對比統計量的值和理論F分布的值，當統計量值超過理論值時，拒絕原假設，否則接受原假設

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

# 模型的F檢驗

# 1.計算F統計量
import numpy as np
# 計算模組化資料中因變量的均值
ybar = train.Profit.mean()
# 統計模型model中變量個數以及訓練集觀測個數
p = model.df_model
n = train.shape[0]
# 計算回歸平方和
RSS = np.sum((model.fittedvalues-ybar)**2)
# 計算殘差平方和
ESS = np.sum(model.resid**2)
# 計算F統計量
F = (RSS/p)/(ESS/(n-p-1))
print(F,"直接得到F：",model.fvalue) #174.63721716844725 174.6372171570355

# 2. 與F分布的理論值對比
# 導入子產品
from scipy.stats import f
# 計算F分布的理論值
F_Theroy = f.ppf(q=0.95, dfn = p,dfd = n-p-1)
print('F分布的理論值為：',F_Theroy) # F分布的理論值為： 2.502635007415366

結論：

計算出來的F統計量值174.64遠遠大于F分布的理論值2.50，是以應當拒絕原假設，即認為多元線性回歸模型是顯著的，也就是說回歸模型的偏回歸系數都不全為0

參數的T檢驗

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

檢視模型的概覽資訊

從結果可以看出，隻有截距項Intercept和研發成本RD_Spend通過了顯著性檢驗，其P值遠小于0.05

model.summary()

OLS Regression Results

Dep. Variable:	Profit	R-squared:	0.964
Model:	OLS	Adj. R-squared:	0.958
Method:	Least Squares	F-statistic:	174.6
Date:	Thu, 22 Apr 2021	Prob (F-statistic):	9.74e-23
Time:	23:34:37	Log-Likelihood:	-401.20
No. Observations:	39	AIC:	814.4
Df Residuals:	33	BIC:	824.4
Df Model:	5
Covariance Type:	nonrobust

coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	5.807e+04	6846.305	8.482	0.000	4.41e+04	7.2e+04
RD_Spend	0.8035	0.040	19.988	0.000	0.722	0.885
Administration	-0.0578	0.051	-1.133	0.265	-0.162	0.046
Marketing_Spend	0.0138	0.015	0.930	0.359	-0.016	0.044
Florida	1440.8627	3059.931	0.471	0.641	-4784.615	7666.340
California	513.4683	3043.160	0.169	0.867	-5677.887	6704.824

Omnibus:	1.721	Durbin-Watson:	1.896
Prob(Omnibus):	0.423	Jarque-Bera (JB):	1.148
Skew:	0.096	Prob(JB):	0.563
Kurtosis:	2.182	Cond. No.	1.60e+06

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 1.6e+06. This might indicate that there are

strong multicollinearity or other numerical problems.

線性回歸模型一進制線性回歸模型的介紹與應用線性回歸模型的假設檢驗

一進制線性回歸模型的介紹與應用