python Series 添加行_Python機器學習（入門）

機器學習步驟：（python機器學習包sklearn)

1.提出問題

2.了解資料

3.資料清洗

4.構模組化型

5.評估模型

一：簡單線性回歸

1.資料集

from collections import OrderedDict
import pandas as pd
examDict={'學習時間':[0.5,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50],
          '分數':[10,22,13,43,20,22,33,50,62,48,55,75,62,73,81,76,64,82,90,93]
}
examOrderDict=OrderedDict(examDict)
examDf=pd.DataFrame(examOrderDict)

python Series 添加行_Python機器學習（入門）

examDf.head()

python Series 添加行_Python機器學習（入門）

#提取特征features
exam_X=examDf.loc[:,'學習時間']
#提取标簽label
exam_y=examDf.loc[:,'分數']

2作圖看分布情況

#散點圖matplotlib
import matplotlib.pyplot as plt
plt.scatter(exam_X,exam_y,color="b",label="exam data")
#添加圖示
plt.xlabel("Hours")
plt.ylabel("Score")
#顯示圖像
plt.show()

python Series 添加行_Python機器學習（入門）

3#建立訓練集和測試集

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(exam_X,
                                              exam_y,
                                              train_size= .8)
#輸出特征和标簽
print('原始資料特征：',exam_X.shape,
     '訓練集資料特征：',X_train.shape,
     '測試集資料特征：',X_test.shape)
print('原始資料标簽：',exam_X.shape,
     '訓練集資料标簽：',y_train.shape,
     '測試集資料标簽：',y_test.shape)
print(type(X_train))

原始資料特征： (20,) 訓練集資料特征： (16,) 測試集資料特征： (4,) 原始資料标簽： (20,) 訓練集資料标簽： (16,) 測試集資料标簽： (4,) <class 'pandas.core.series.Series'>

#相關系數矩陣
rDf=examDf.corr()
rDf

python Series 添加行_Python機器學習（入門）

#将訓練集特征轉化為二維數組**行1列
X_train=X_train.reshape(-1,1)
#将測試集特征轉化為二維數組**行1列
X_test=X_test.reshape(-1,1)

4#訓練模型

from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

#最佳拟合線
a=model.intercept_
b=model.coef_
print('最佳拟合線：截距為',a,',回歸系數為',b)

最佳拟合線：截距為 5.83626318433 ,回歸系數為 [ 17.02461075]

5#評估模型精确度：決定系數R平方看模型拟合程度

model.score(X_test,y_test)

0.81147243948428172

二：邏輯回歸LogisticRegression———分類問題

1.導入資料

#邏輯回歸
#資料集
from collections import OrderedDict
import pandas as pd
examDict={'學習時間':[0.5,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50],
         '通過考試':[0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]}
examOrderDict=OrderedDict(examDict)
examDf=pd.DataFrame(examOrderDict)
examDf.head()

python Series 添加行_Python機器學習（入門）

2.繪制散點圖

import matplotlib.pyplot as plt
plt.scatter(exam_X,exam_y,color="b",label="exam data")
plt.xlabel("Hours")
plt.ylabel("Pass")
plt.show()

python Series 添加行_Python機器學習（入門）

3.提取特征和标簽

#提取特征feature
exam_X=examDf.loc[:,'學習時間']
#提取标簽labels
exam_y=examDf.loc[:,'通過考試']

4.建立訓練集和測試集

from sklearn.model_selection import train_test_split
X_train,y_train,X_test,y_test=train_test_split(exam_X,
                                              exam_y,
                                              train_size=0.8)
#輸出特征和标簽
print('原始資料特征：',exam_X.shape,
     '訓練集資料特征：',X_train.shape,
     '測試集資料特征：',X_test.shape)
print('原始資料标簽：',exam_X.shape,
     '訓練集資料标簽：',y_train.shape,
     '測試集資料标簽：',y_test.shape)
print(type(X_train))

原始資料特征： (20,) 訓練集資料特征： (16,) 測試集資料特征： (4,)

原始資料标簽： (20,) 訓練集資料标簽： (16,) 測試集資料标簽： (4,)

#将訓練集特征轉化為二維數組**行1列
X_train=X_train.reshape(-1,1)
#将測試集特征轉化為二維數組**行1列
X_test=X_test.reshape(-1,1)

5.訓練模型

#第一步：導入邏輯回歸
from sklearn.linear_model import LogisticRegression
#建立模型
model=LogisticRegression()
#訓練模型
model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,

penalty='l2', random_state=None, solver='liblinear', tol=0.0001,

verbose=0, warm_start=False)

6.#評估模型

model.score(X_test,y_test)

0.75

7.預測通過考試的機率

（1）predict_proba函數：

#實施模型
model.predict_proba(3)

array([[ 0.33884591, 0.66115409]])

（2）predict函數：

#預測資料：使用模型的predict方法可以進行預測。這裡我們輸入學生的特征學習時間3小時，模型傳回結果标簽是1，表示預測該學生通過考試。
pred=model.predict([[3]])
print(pred)

[1]

（3）邏輯函數：

import numpy as np
a=model.intercept_
b=model.coef_
x=3
z=a+b*x
pred_Y=1/(1+np.exp(-z))
print('預測的機率值:',pred_Y)

預測的機率值: [[ 0.66115409]]

表明：學習時間為3小時，通過考試的機率為66.1%

三種資料類型：

數值資料（定量資料）
分類資料（定性資料）
時間序列資料（連續資料）

總結：

特征（features):資料的特點，标簽(labels):資料的預測結果

2.訓練資料和測試資料訓練資料是建立機器學習模型，測試資料是驗證模型的正确率

3. 分類問題：1）本質：決策面（decision surface，D.S.)

2) 評估分類算法的名額，正确率=正确分類個數/總個數

4. 邏輯回歸：1）邏輯回歸用于二分分類問題

2）邏輯函數（sigmoid function）

5. 邏輯回歸python實作：1)建立訓練集和測試集：sklearn的train_test_split

2) 邏輯回歸：sklearn的LugisticRegression

6. 三種資料類型:1)數值資料

2）分類資料

3）時間序列資料

7. 分類和回歸的差別

8. 機器學習算法和機器學習模型的差別：機器學習模型=機器算法+訓練資料

三：機器學習算法

1. 随機森林

from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(n_estimators=100)

2. 支援向量機SVM

from sklearn.svm import SVC,LinearSVC
model=SVC()

3.Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier
model= GradientBoostingClassifier()

4.K_nearest neighbors

from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=3)

5.Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB
model=GaussianNB()

python Series 添加行_Python機器學習（入門）

一：簡單線性回歸

二：邏輯回歸LogisticRegression———分類問題

總結：

三：機器學習算法

繼續閱讀

python concat函數多張表_2020 年入門資料分析選擇 Python 還是 SQL？七個常用操作對比！...