作者：xiaoyu，半路轉行資料挖掘

原創出品：Python資料科學

資料探索很麻煩？推薦一款史上最強大的特征分析可視化工具：yellowbrick遞歸特征消除 Recursive Feature Elimination二分類辨識門檻值 Discrimination Threshold

前言

玩過模組化的朋友都知道，在建立模型之前有很長的一段特征工程工作要做，而在特征工程的過程中，探索性資料分析又是必不可少的一部分，因為如果我們要對各個特征進行細緻的分析，那麼必然會進行一些可視化以輔助我們來做選擇和判斷。

可視化的工具有很多，但是能夠針對特征探索性分析而進行專門可視化的不多，今天給大家介紹一款功能十分強大的工具：yellowbrick，希望通過這個工具的輔助可以節省更多探索的時間，快速掌握特征資訊。

功能

雷達 RadViz

RadViz雷達圖是一種多變量資料可視化算法，它圍繞圓周均勻地分布每個特征，并且标準化了每個特征值。一般資料科學家使用此方法來檢測類之間的關聯。例如，是否有機會從特征集中學習一些東西或是否有太多的噪音？

# Load the classification data set
data = load_data("occupancy")

# Specify the features of interest and the classes of the target
features = ["temperature", "relative humidity", "light", "C02", "humidity"]
classes = ["unoccupied", "occupied"]

# Extract the instances and target
X = data[features]
y = data.occupancy

# Import the visualizer
from yellowbrick.features import RadViz

# Instantiate the visualizer
visualizer = RadViz(classes=classes, features=features)

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.poof()         # Draw/show/poof the data

複制

從上面雷達圖可以看出5個次元中，溫度對于目标類的影響是比較大的。

一維排序 Rank 1D

特征的一維排序利用排名算法，僅考慮單個特征，預設情況下使用Shapiro-Wilk算法來評估與特征相關的執行個體分布的正态性，然後繪制一個條形圖，顯示每個特征的相對等級。

from yellowbrick.features import Rank1D

# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
visualizer = Rank1D(features=features, algorithm='shapiro')

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.poof()                   # Draw/show/poof the data

複制

PCA Projection

PCA分解可視化利用主成分分析将高維資料分解為二維或三維，以便可以在散點圖中繪制每個執行個體。PCA的使用意味着可以沿主要變化軸分析投影資料集，并且可以解釋該資料集以确定是否可以利用球面距離度量。

雙重圖 Biplot

PCA投影可以增強到雙點，其點是投影執行個體，其矢量表示高維空間中資料的結構。通過使用proj_features = True标志，資料集中每個要素的向量将在散點圖上以該要素的最大方差方向繪制。這些結構可用于分析特征對分解的重要性或查找相關方差的特征以供進一步分析。

# Load the classification data set
data = load_data('concrete')

# Specify the features of interest and the target
target = "strength"
features = [
    'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]

# Extract the instance data and the target
X = data[features]
y = data[target]

visualizer = PCADecomposition(scale=True, proj_features=True)
visualizer.fit_transform(X, y)
visualizer.poof()

複制

特征重要性 Feature Importance

特征工程過程涉及選擇生成有效模型所需的最小特征，因為模型包含的特征越多，它就越複雜（資料越稀疏），是以模型對方差的誤差越敏感。消除特征的常用方法是描述它們對模型的相對重要性，然後消除弱特征或特征組合并重新評估以确定模型在交叉驗證期間是否更好。

在scikit-learn中，Decision Tree模型和樹的集合（如Random Forest，Gradient Boosting和AdaBoost）在拟合時提供feature_importances_屬性。Yellowbrick FeatureImportances可視化工具利用此屬性對相對重要性進行排名和繪制。

import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier

from yellowbrick.features.importances import FeatureImportances

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

viz = FeatureImportances(GradientBoostingClassifier(), ax=ax)
viz.fit(X, y)
viz.poof()

複制

遞歸特征消除 Recursive Feature Elimination

遞歸特征消除（RFE）是一種特征選擇方法，它訓練模型并删除最弱的特征（或多個特征），直到達到指定數量的特征。特征按模型的coef_或feature_importances_屬性排序，并通過遞歸消除每個循環的少量特征，RFE嘗試消除模型中可能存在的依賴性和共線性。

RFE需要保留指定數量的特征，但事先通常不知道有多少特征有效。為了找到最佳數量的特征，交叉驗證與RFE一起用于對不同的特征子集進行評分，并選擇最佳評分特征集合。RFECV可視化繪制模型中的特征數量以及它們的交叉驗證測試分數和可變性，并可視化所選數量的特征。

from sklearn.svm import SVC
from sklearn.datasets import make_classification

from yellowbrick.features import RFECV

# Create a dataset with only 3 informative features
X, y = make_classification(
    n_samples=1000, n_features=25, n_informative=3, n_redundant=2,
    n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0
)

# Create RFECV visualizer with linear SVM classifier
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()

複制

該圖顯示了理想的RFECV曲線，當捕獲三個資訊特征時，曲線跳躍到極好的準确度，然後随着非資訊特征被添加到模型中，精度逐漸降低。陰影區域表示交叉驗證的可變性，一個标準偏差高于和低于曲線繪制的平均精度得分。

下面是一個真實資料集，我們可以看到RFECV對信用違約二進制分類器的影響。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

df = load_data('credit')

target = 'default'
features = [col for col in data.columns if col != target]

X = data[features]
y = data[target]

cv = StratifiedKFold(5)
oz = RFECV(RandomForestClassifier(), cv=cv, scoring='f1_weighted')

oz.fit(X, y)
oz.poof()

複制

在這個例子中，我們可以看到選擇了19個特征，盡管在大約5個特征之後模型的f1分數似乎沒有太大改善。選擇要消除的特征在确定每個遞歸的結果中起着重要作用；修改步驟參數以在每個步驟中消除多個特征可能有助于盡早消除最差特征，增強其餘特征（并且還可用于加速具有大量特征的資料集的特征消除）。

殘差圖 Residuals Plot

在回歸模型的上下文中，殘差是目标變量（y）的觀測值與預測值（ŷ）之間的差異，例如，預測的錯誤。殘差圖顯示垂直軸上的殘差與水準軸上的因變量之間的差異，允許檢測目标中可能容易出錯或多或少的誤差的區域。

from sklearn.linear_model import Ridge
from yellowbrick.regressor import ResidualsPlot

# Instantiate the linear model and visualizer
ridge = Ridge()
visualizer = ResidualsPlot(ridge)

visualizer.fit(X_train, y_train)  # Fit the training data to the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.poof()                 # Draw/show/poof the data

複制

正則化 Alpha Selection

正則化旨在懲罰模型複雜性，是以α越高，模型越複雜，由于方差（過度拟合）而減少誤差。另一方面，太高的Alpha會因偏差（欠調）而增加誤差。是以，重要的是選擇最佳α，以便在兩個方向上最小化誤差。

AlphaSelection Visualizer示範了不同的α值如何影響線性模型正則化過程中的模型選擇。一般而言，α增加了正則化的影響，例如，如果alpha為零，則沒有正則化，α越高，正則化參數對最終模型的影響越大。

import numpy as np

from sklearn.linear_model import LassoCV
from yellowbrick.regressor import AlphaSelection

# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)

# Instantiate the linear model and visualizer
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)

visualizer.fit(X, y)
g = visualizer.poof()

複制

分類預測誤差 Class Prediction Error

類預測誤差圖提供了一種快速了解分類器在預測正确類别方面有多好的方法。

from sklearn.ensemble import RandomForestClassifier

from yellowbrick.classifier import ClassPredictionError

# Instantiate the classification model and visualizer
visualizer = ClassPredictionError(
    RandomForestClassifier(), classes=classes
)

# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)

# Evaluate the model on the test data
visualizer.score(X_test, y_test)

# Draw visualization
g = visualizer.poof()

複制

當然也同時有分類評估名額的可視化，包括混淆矩陣、AUC/ROC、召回率/精準率等等。

二分類辨識門檻值 Discrimination Threshold

關于二進制分類器的辨識門檻值的精度，召回，f1分數和queue rate的可視化。辨識門檻值是在陰性類别上選擇正類别的機率或分數。通常，将其設定為50％，但可以調整門檻值以增加或降低對誤報或其他應用因素的敏感度。

from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import DiscriminationThreshold

# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = DiscriminationThreshold(logistic)

visualizer.fit(X, y)  # Fit the training data to the visualizer
visualizer.poof()     # Draw/show/poof the data

複制

聚類肘部法則 Elbow Method

KElbowVisualizer實作了“肘部”法則，通過使模型具有K的一系列值來幫助資料科學家選擇最佳簇數。如果折線圖類似于手臂，那麼“肘”（拐點）就是曲線）是一個很好的迹象，表明基礎模型最适合那一點。

在下面的示例中，KElbowVisualizer在具有8個随機點集的樣本二維資料集上适合KMeans模型，以獲得4到11的K值範圍。當模型适合8個聚類時，我們可以在圖中看到“肘部”，在這種情況下，我們知道它是最佳數字。

from sklearn.datasets import make_blobs

# Create synthetic dataset with 8 random clusters
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

複制

叢集間距離圖 Intercluster Distance Maps

叢集間距離地圖以2維方式顯示叢集中心的嵌入，并保留與其他中心的距離。例如。中心越靠近可視化，它們就越接近原始特征空間。根據評分名額調整叢集的大小。預設情況下，它們按内部資料的多少，例如屬于每個中心的執行個體數。這給出了叢集的相對重要性。但請注意，由于兩個聚類在2D空間中重疊，是以并不意味着它們在原始特征空間中重疊。

from sklearn.datasets import make_blobs

# Make 12 blobs dataset
X, y = make_blobs(centers=12, n_samples=1000, n_features=16, shuffle=True)

from sklearn.cluster import KMeans
from yellowbrick.cluster import InterclusterDistance

# Instantiate the clustering model and visualizer
visualizer = InterclusterDistance(KMeans(9))

visualizer.fit(X) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data

複制

模型選擇-學習曲線 Learning Curve

學習曲線基于不同數量的訓練樣本，檢驗模型訓練分數與交叉驗證測試分數的關系。這種可視化通常用來表達兩件事：

1. 模型會不會随着資料量增多而效果變好

2. 模型對偏差和方差哪個更加敏感

下面是利用yellowbrick生成的學習曲線可視化圖。該學習曲線對于分類、回歸和聚類都可以适用。

模型選擇-驗證曲線 Validation Curve

模型驗證用于确定模型對其已經過訓練的資料的有效性以及它對新輸入的泛化程度。為了測量模型的性能，我們首先将資料集拆分為訓練和測試，将模型拟合到訓練資料上并在保留的測試資料上進行評分。

為了最大化分數，必須選擇模型的超參數，以便最好地允許模型在指定的特征空間中操作。大多數模型都有多個超參數，選擇這些參數組合的最佳方法是使用網格搜尋。然而，繪制單個超參數對訓練和測試資料的影響有時是有用的，以确定模型是否對某些超參數值不适合或過度拟合。

import numpy as np

from sklearn.tree import DecisionTreeRegressor
from yellowbrick.model_selection import ValidationCurve

# Load a regression dataset
data = load_data('energy')

# Specify features of interest and the target
targets = ["heating load", "cooling load"]
features = [col for col in data.columns if col not in targets]

# Extract the instances and target
X = data[features]
y = data[targets[0]]

viz = ValidationCurve(
    DecisionTreeRegressor(), param_name="max_depth",
    param_range=np.arange(1, 11), cv=10, scoring="r2"
)

# Fit and poof the visualizer
viz.fit(X, y)
viz.poof()

複制

總結

個人認為yellowbrick這個工具非常好，一是因為解決了特征工程和模組化過程中的可視化問題，極大地簡化了操作；二是通過各種可視化也可以補充自己對模組化的一些盲區。

本篇僅展示了模組化中部分可視化功能，詳細的完整功能請參考：

https://www.scikit-yb.org/en/latest/index.html

資料探索很麻煩？推薦一款史上最強大的特征分析可視化工具：yellowbrick遞歸特征消除 Recursive Feature Elimination二分類辨識門檻值 Discrimination Threshold

遞歸特征消除 Recursive Feature Elimination

二分類辨識門檻值 Discrimination Threshold