python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

綱要

boss說增加項目平台分析方法：

T檢驗（獨立樣本T檢驗）、線性回歸、二進制Logistics回歸、因子分析、可靠性分析

根本不懂，一臉懵逼狀态，分析部确實有人才，反正我是一臉懵

首先解釋什麼是二進制Logistic回歸分析吧

二進制Logistics回歸可以用來做分類，回歸更多的是用于預測

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

官方簡介：

連結：https://pythonfordatascience.org/logistic-regression-python/

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

Logistic regression models are used to analyze the relationship between a dependent variable (DV) and independent variable(s) (IV) when the DV is dichotomous. The DV is the outcome variable, a.k.a. the predicted variable, and the IV(s) are the variables that are believed to have an influence on the outcome, a.k.a. predictor variables. If the model contains 1 IV, then it is a simple logistic regression model, and if the model contains 2+ IVs, then it isa multiple logistic regression model.

Assumptionsforlogistic regression models:

The DViscategorical (binary)

If there are more than2 categories interms of types of outcome, a multinomial logistic regression should be used

Independence of observations

Cannot be a repeated measures design, i.e. collecting outcomes at two different time points.

Independent variables are linearly related to the log odds

Absence of multicollinearity

Lack of outliers

原文

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

了解了什麼是二進制以後，開始找庫

需要用的包

這裡需要特别說一下，第一天晚上我就用的logit，但結果不對，然後用機器學習搞，發現結果還不對，用spss比對的值

奇怪，最後沒辦法，隻能抱大腿了，因為他們糾結Logit和Logistic的差別，然後有在群裡問了下，有大佬給解惑了

而且也有下面文章給解惑

1. 是 statsmodels 的logit子產品

2. 是 sklearn.linear_model 的 LogisticRegression子產品

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

先說第一種方法

首先借鑒文章連結：https://blog.csdn.net/zj360202/article/details/78688070?utm_source=blogxgwz0

解釋的比較清楚，但是一定要注意一點就是，截距項，我就是在這個地方出的問題，因為我覺得不重要，就沒加

#!/usr/bin/env#-*- coding:utf-8 -*-

importpandas as pdimportstatsmodels.api as smimportpylab as plimportnumpy as npfrom pandas importDataFrame, Seriesfrom sklearn.cross_validation importtrain_test_splitfrom sklearn.linear_model importLogisticRegressionfrom sklearn importmetricsfrom collections importOrderedDict

data={'y': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1],'x': [i for i in range(1, 21)],

}

df=DataFrame(OrderedDict(data))

df["intercept"] = 1.0 #截距項，很重要的呦，我就錯在這裡了

print(df)print("==================")print(len(df))print(df.columns.values)print(df[df.columns[1:]])

logit= sm.Logit(df['y'], df[df.columns[1:]])#result =logit.fit()#res =result.summary2()print(res)

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

這麼寫我覺得更好，因為上面那麼寫執行第二遍的時候總是報錯：

statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available

我改成x， y變量自己是自己的，就莫名其妙的好了

obj =TwoDimensionalLogisticRegressionModel()

data_x=obj.SelectVariableSql( UserID, ProjID, QuesID, xVariable, DatabaseName, TableName, CasesCondition)

data_y=obj.SelectVariableSql( UserID, ProjID, QuesID, yVariable, DatabaseName, TableName, CasesCondition)if len(data_x) !=len(data_y):raise MyCustomError(retcode=4011)

obj.close()

df_X=DataFrame(OrderedDict(data_x))

df_Y=DataFrame(OrderedDict(data_y))

df_X["intercept"] = 1.0 #截距項，很重要的呦，我就錯在這裡了

logit =sm.Logit(df_Y, df_X)

result=logit.fit()

res=result.summary()

data= [j for j in [i for i in str(res).split('\n')][-3].split(' ') if j != ''][1:]return data

允許二分數值虛拟變量的使用，修改後

obj =TwoDimensionalLogisticRegressionModel()

data_x=obj.SelectVariableSql( UserID, ProjID, QuesID, xVariable, DatabaseName, TableName, CasesCondition)

data_y=obj.SelectVariableSql( UserID, ProjID, QuesID, yVariable, DatabaseName, TableName, CasesCondition)if len(data_x) !=len(data_y):

raise MyCustomError(retcode=4011)

obj.close()

df_X=DataFrame(data_x)

df_Y= DataFrame(data_y) # 因變量，0， 1df_X["intercept"] = 1.0# 截距項，很重要的呦，我就錯在這裡了

YColumnList=list(df_Y[yVariable].values)

setYColumnList= list(set(YColumnList))if len(setYColumnList) > 2 or len(setYColumnList) < 2:

raise MyCustomError(retcode=4015)else:if len(setYColumnList) == 2 and [0,1] != [int(i) for i insetYColumnList]:

newYcolumnsList=[]for i inYColumnList:if i == setYColumnList[0]:

newYcolumnsList.append(0)else:

newYcolumnsList.append(1)

df_Y=DataFrame({yVariable:newYcolumnsList})

logit=sm.Logit(df_Y, df_X)

result=logit.fit()

res=result.summary()

data= [j for j in [i for i in str(res).split('\n')][-3].split(' ') if j != '']return data[1:]

再次更新後

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

defTwoDimensionalLogisticRegressionDetail(UserID, ProjID, QuesID, xVariableID, yVariableID, CasesCondition):

two_obj=TwoDimensionalLogisticModel()

sql_data, xVarName, yVarName=two_obj.showdatas(UserID, ProjID, QuesID, xVariableID, yVariableID, CasesCondition)

two_obj.close()

df_dropna=DataFrame(sql_data).dropna()

df_X=DataFrame()

df_Y= DataFrame() #因變量，0， 1

df_X[xVarName]=df_dropna[xVarName]

df_Y[yVarName]=df_dropna[yVarName]

df_X["intercept"] = 1.0 #截距項，很重要的呦，我就錯在這裡了

YColumnList=list(df_Y[yVarName].values)

setYColumnList=list(set(YColumnList))#print(setYColumnList)

if len(setYColumnList) > 2 or len(setYColumnList) < 2:raise MyCustomError(retcode=4015)#else:

if len(setYColumnList) == 2 and [0, 1] != [int(i) for i insetYColumnList]:

newYcolumnsList=[]for i inYColumnList:if i ==setYColumnList[0]:

newYcolumnsList.append(0)else:

newYcolumnsList.append(1)

df_Y=DataFrame({yVarName: newYcolumnsList})

logit=sm.Logit(df_Y, df_X)

res=logit.fit()

res_all=res.summary()

LogLikelihood= [i.strip() for i in str(res_all).split("\n")[6].split(" ") if i][3]#沒找到具體參數，隻能這麼分割

index_var = [i.strip() for i in str(res_all).split("\n")[12].split(" ") ifi]

intercept= [i.strip() for i in str(res_all).split("\n")[13].split(" ") ifi]

std_err= [index_var[2], intercept[2]]

z= [index_var[3], intercept[3]]

P_z= [index_var[4], intercept[4]] #顯著性

interval_25 = [index_var[5], intercept[5]]

interval_975= [index_var[6], intercept[6]]

Odds_Ratio= [math.e ** i for i inlist(res.params)]return{"No_Observations": res.nobs,#No. Observations

"Pseudo_R": res.prsquared,#Pseudo R^2

"Log_Likelihood": LogLikelihood, #LogLikelihood

"LLNull": res.llnull,"llr_pvalue": res.llr_pvalue, #ｌｌｒ顯著性

"coef": list(res.params), #系數

"std_err": std_err,"Odds_Ratio": Odds_Ratio,"z": z,"P": P_z, #顯著性

"interval_25": interval_25, #區間0.025

"interval_975": interval_975

}

View Code

第二種方法，機器學習

參考連結：https://zhuanlan.zhihu.com/p/34217858

#!/usr/bin/env python#-*- coding:utf-8 -*-

from collections importOrderedDictimportpandas as pd

examDict={'學習時間': [i for i in range(1, 20)],'通過考試': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1]

}

examOrderDict=OrderedDict(examDict)

examDF=pd.DataFrame(examOrderDict)#print(examDF.head())

exam_X= examDF.loc[:, "學習時間"]

exam_Y= examDF.loc[:, "通過考試"]print(exam_X)#print(exam_Y)

from sklearn.cross_validation importtrain_test_split

X_train,X_test,y_train, y_test= train_test_split(exam_X,exam_Y, train_size=0.8)#print(X_train.values)

print(len(X_train.values))

X_train= X_train.values.reshape(-1, 1)print(len(X_train))print(X_train)

X_test= X_test.values.reshape(-1, 1)from sklearn.linear_model importLogisticRegression

module_1=LogisticRegression()

module_1.fit(X_train, y_train)print("coef:", module_1.coef_)

front=module_1.score(X_test,y_test)print(front)print("coef:", module_1.coef_)print("intercept_:", module_1.intercept_)#預測

pred1 = module_1.predict_proba(3)print("預測機率[N/Y]", pred1)

pred2= module_1.predict(5)print(pred2)

但是，機器學習的這個有問題，就是隻抽取了15個值

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

statsmodels的庫連結

Statsmodels：http://www.statsmodels.org/stable/index.html

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

python對邏輯回歸進行顯著性_python3二進制Logistics Regression 回歸分析（LogisticRegression）...

繼續閱讀

python對邏輯回歸進行顯著性_機器學習之利用Python進行邏輯回歸分析

python對邏輯回歸進行顯著性_python實作邏輯回歸

python對邏輯回歸進行顯著性,在python中計算邏輯回歸

python對邏輯回歸進行顯著性_深入解讀Logistic回歸結果（一）：回歸系數，OR

python對邏輯回歸進行顯著性_python sklearn庫實作簡單邏輯回歸的執行個體代碼