python 決策樹_使用python+sklearn實作決策樹的剪枝

DecisionTreeClassifier

提供諸如

min_samples_leaf

和

max_depth

這樣的參數，來防止樹過拟合。代價複雜度剪枝提供了另一種選擇來控制樹的大小。在

DecisionTreeClassifier

中，該剪枝技術由代價複雜度參數

ccp_alpha

進行參數化，較大的

ccp_alpha

值會增加剪枝的節點數量。在這裡，我們僅顯示

ccp_alpha

對規則化樹的影響，以及如何根據驗證分數(validation scores)來選擇

ccp_alpha

。

另請參見最小代價複雜度剪枝，以了解有關剪枝的詳細資訊。

print(__doc__)
             import matplotlib.pyplot as plt
             from sklearn.model_selection import train_test_split
             from sklearn.datasets import load_breast_cancer
             from sklearn.tree import DecisionTreeClassifier

剪枝樹葉子的總雜質與有效 alphas 的關系

最小代價複雜度剪枝遞歸地找到具有“最弱連接配接”的節點。最弱連接配接具有有效 alpha，其中具有最小有效 alpha 的節點首先被剪枝。為了了解

ccp_alpha

何值合适，scikit-learn提供了

DecisionTreeClassifier.cost_complexity_pruning_path

，它傳回修剪過程中每個步驟的有效 alphas 和相應葉子的總雜質。随着 alpha 增大，更多的 tree 被剪枝，增加了其葉子的總雜質。

X, y = load_breast_cancer(return_X_y=True)
             X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
                 clf = DecisionTreeClassifier(random_state=0)
             path = clf.cost_complexity_pruning_path(X_train, y_train)
             ccp_alphas, impurities = path.ccp_alphas, path.impurities

在下面的圖中，最大有效 alpha 值被删除，因為它是一棵隻有一個節點的平凡樹(trivial tree)。

fig, ax = plt.subplots()
             ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
             ax.set_xlabel("effective alpha")
             ax.set_ylabel("total impurity of leaves")
             ax.set_title("Total Impurity vs effective alpha for training set")

python 決策樹_使用python+sklearn實作決策樹的剪枝

sphx_glr_plot_cost_complexity_pruning_001

輸出：

Text(0.5, 1.0, 'Total Impurity vs effective alpha for training set')

接下來，我們使用有效 alphas 訓練一棵決策樹。

ccp_alphas

中的最後一個值(

clfs[-1]

)是修剪整棵樹的 alpha 值，使樹隻剩下一個節點。

clfs = []
             for ccp_alpha in ccp_alphas:
              clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
              clf.fit(X_train, y_train)
              clfs.append(clf)
             print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
              clfs[-1].tree_.node_count, ccp_alphas[-1]))

輸出：

Number of nodes in the last tree is: 1 with ccp_alpha: 0.3272984419327777

對于本例的其餘部分，我們将删除

clfs

和

ccp_alphas

中的最後一個元素，因為它是一棵隻有一個節點的決策樹。這裡我們顯示了節點的數量和樹的深度随着 alpha 的增加而減少。

clfs = clfs[:-1]
             ccp_alphas = ccp_alphas[:-1]
                 node_counts = [clf.tree_.node_count for clf in clfs]
             depth = [clf.tree_.max_depth for clf in clfs]
             fig, ax = plt.subplots(2, 1)
             ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
             ax[0].set_xlabel("alpha")
             ax[0].set_ylabel("number of nodes")
             ax[0].set_title("Number of nodes vs alpha")
             ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
             ax[1].set_xlabel("alpha")
             ax[1].set_ylabel("depth of tree")
             ax[1].set_title("Depth vs alpha")
             fig.tight_layout()

python 決策樹_使用python+sklearn實作決策樹的剪枝

sphx_glr_plot_cost_complexity_pruning_002

訓練集和測試集的準确度與 alpha

當

ccp_alpha

設定為零并保留

DecisionTreeClassifier

的其他預設參數不變時，樹會過拟合，會導緻100%的訓練準确度和88%的測試準确度。随着 alpha 的增加，更多的 tree 被修剪，進而建立了一個泛化能力更強的決策樹。在本例中，設定

ccp_alpha=0.015

可以最大限度地提高測試準确度。

train_scores = [clf.score(X_train, y_train) for clf in clfs]
             test_scores = [clf.score(X_test, y_test) for clf in clfs]
                 fig, ax = plt.subplots()
             ax.set_xlabel("alpha")
             ax.set_ylabel("accuracy")
             ax.set_title("Accuracy vs alpha for training and testing sets")
             ax.plot(ccp_alphas, train_scores, marker='o', label="train",
              drawstyle="steps-post")
             ax.plot(ccp_alphas, test_scores, marker='o', label="test",
              drawstyle="steps-post")
             ax.legend()
             plt.show()

python 決策樹_使用python+sklearn實作決策樹的剪枝

sphx_glr_plot_cost_complexity_pruning_003

python 決策樹_使用python+sklearn實作決策樹的剪枝

下載下傳python源代碼:plot_random_multilabel_dataset.py

下載下傳Jupyter notebook源代碼:plot_random_multilabel_dataset.ipynb

由Sphinx-Gallery生成的畫廊

python 決策樹_使用python+sklearn實作決策樹的剪枝

☆☆☆為友善大家查閱，小編已将scikit-learn學習路線專欄文章統一整理到公衆号底部菜單欄，同步更新中，關注公衆号，點選左下方“系列文章”，如圖：

python 決策樹_使用python+sklearn實作決策樹的剪枝

歡迎大家和我一起沿着scikit-learn文檔這條路線，一起鞏固機器學習算法基礎。(添加微信：mthler，備注：sklearn學習，一起進【sklearn機器學習進步群】開啟打怪更新的學習之旅。)

python 決策樹_使用python+sklearn實作決策樹的剪枝

python 決策樹_使用python+sklearn實作決策樹的剪枝

剪枝樹葉子的總雜質與有效 alphas 的關系

訓練集和測試集的準确度與 alpha

繼續閱讀

書籍python科學工程介紹 Python for Science and Engineering - 2019

書籍:Learning Python for Forensics 2nd Edition - 2019.pdf

書籍：樹莓派家庭自動化 Home Automation(python) with Raspberry Pi - 2019.pdf

codeforces1151B Dima(異或的性質)

Python：Python技巧之80個經典題——課程筆記(二)

Python：Python技巧之80個經典題——課程筆記(四)

sklearn 決策樹_Sklearn中分類決策樹的重要參數詳解

sklearn 決策樹_決策樹原理以及sklearn中決策樹的參數詳解開篇決策樹的優勢什麼是決策樹？建立決策樹python 機器學習包Sklearn中如何實作決策樹

sklearn 決策樹_初識決策樹及sklearn實作

sklearn 決策樹_Sklearn學習筆記（1）-決策樹

sklearn 決策樹_持倉股與 sklearn實戰1 決策樹（分類樹，回歸樹）與随機森林實戰...1 概述1.RandomForestClassifier

機器學習之決策樹算法（二）

python 決策樹_機器學習(15): 決策樹及Python實作

python決策樹用于分類和回歸問題實際應用案例

python 決策樹_如何利用Python建立決策樹模型?

python決策樹_python決策樹（二叉樹、樹）的可視化