【阿旭機器學習實戰】系列文章主要介紹機器學習的各種算法模型及其實戰案例，歡迎點贊，關注共同學習交流。

【阿旭機器學習實戰】【36】糖尿病預測—決策樹模組化及其可視化

1. 導入資料并檢視資料

關注GZH：阿旭算法與機器學習，回複：“ML36”即可擷取本文資料集、源碼與項目文檔

# 導入資料包
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split 
from sklearn import metrics 
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
df = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)

pregnant	glucose	bp	skin	insulin	bmi	pedigree	age	label
6	148	72	35	33.6	0.627	50	1
1	1	85	66	29	26.6	0.351	31
2	8	183	64	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21
4	137	40	35	168	43.1	2.288	33	1

# 相關性矩陣
corr = df.iloc[:,:-1].corr()
#corr = (corr)
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

corr

pregnant	glucose	bp	skin	insulin	bmi	pedigree	age
pregnant	1.000000	0.129459	0.141282	-0.081672	-0.073535	0.017683	-0.033523	0.544341
glucose	0.129459	1.000000	0.152590	0.057328	0.331357	0.221071	0.137337	0.263514
bp	0.141282	0.152590	1.000000	0.207371	0.088933	0.281805	0.041265	0.239528
skin	-0.081672	0.057328	0.207371	1.000000	0.436783	0.392573	0.183928	-0.113970
insulin	-0.073535	0.331357	0.088933	0.436783	1.000000	0.197859	0.185071	-0.042163
bmi	0.017683	0.221071	0.281805	0.392573	0.197859	1.000000	0.140647	0.036242
pedigree	-0.033523	0.137337	0.041265	0.183928	0.185071	0.140647	1.000000	0.033561
age	0.544341	0.263514	0.239528	-0.113970	-0.042163	0.036242	0.033561	1.000000

【阿旭機器學習實戰】【36】糖尿病預測---決策樹模組化及其可視化【阿旭機器學習實戰】【36】糖尿病預測—決策樹模組化及其可視化1. 導入資料并檢視資料2. 訓練決策樹模型及其可視化

2. 訓練決策樹模型及其可視化

# 選擇預測所需的特征
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # 特征
y = pima.label # 類别标簽

# 将資料分為訓練和測試資料
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

2.1 決策樹模型

# 建立決策樹分類器
clf = DecisionTreeClassifier(criterion='entropy')

# 訓練模型
clf = clf.fit(X_train,y_train)

# 使用訓練好的模型做預測
y_pred = clf.predict(X_test)

# 模型的準确性
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7489177489177489

2.2 可視化訓練好的決策樹模型

注意: 需要使用如下指令安裝額外兩個包用于畫決策樹的圖

conda install python-graphviz

conda install pydotplus

from sklearn.tree import export_graphviz
from six import StringIO 
from IPython.display import Image  
import pydotplus
from sklearn import tree

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

# 建立新的決策樹, 限定樹的最大深度, 減少過拟合
clf = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_depth=4, # 定義樹的深度, 可以用來防止過拟合
    min_weight_fraction_leaf=0.01 # 定義葉子節點最少需要包含多少個樣本(使用百分比表達), 防止過拟合
    )

# 訓練模型
clf.fit(X_train,y_train)

# 預測
y_pred = clf.predict(X_test)

# 模型的性能
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7705627705627706

from six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

graph.write_png('diabetes2.png')
Image(graph.create_png())

2.2 使用随機森林模型

from sklearn.ensemble import RandomForestClassifier

# 随機森林, 通過調整參數來擷取更好的結果
rf = RandomForestClassifier(
    criterion='entropy',
    n_estimators=1, 
    max_depth=5, # 定義樹的深度, 可以用來防止過拟合
    min_samples_split=10, # 定義至少多少個樣本的情況下才繼續分叉
    #min_weight_fraction_leaf=0.02 # 定義葉子節點最少需要包含多少個樣本(使用百分比表達), 防止過拟合
    )

# 訓練模型
rf.fit(X_train, y_train)

# 做預測
y_pred = rf.predict(X_test)

# 模型的準确率
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7402597402597403

如果文章對你有幫助，感謝點贊+關注！

關注下方GZH：阿旭算法與機器學習，回複：“ML36”即可擷取本文資料集、源碼與項目文檔，歡迎共同學習交流

【阿旭機器學習實戰】【36】糖尿病預測---決策樹模組化及其可視化【阿旭機器學習實戰】【36】糖尿病預測—決策樹模組化及其可視化1. 導入資料并檢視資料2. 訓練決策樹模型及其可視化

【阿旭機器學習實戰】【36】糖尿病預測—決策樹模組化及其可視化

目錄

1. 導入資料并檢視資料

2. 訓練決策樹模型及其可視化

2.1 決策樹模型

2.2 可視化訓練好的決策樹模型

2.2 使用随機森林模型

繼續閱讀

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入