💡 作者：韓信子@ShowMeAI

📘 機器學習實戰系列: https://www.showmeai.tech/tutorials/41

📘 本文位址：https://www.showmeai.tech/article-detail/287

📢 聲明：版權所有，轉載請聯系平台與作者并注明出處

📢 收藏ShowMeAI檢視更多精彩内容

機器學習與流水線（pipeline）簡介

我們知道機器學習應用過程包含很多步驟，如圖所示『标準機器學習應用流程』，有資料預處理、特征工程、模型訓練、模型疊代優化、部署預估等環節。

在簡單分析與模組化時，可以對每個闆塊進行單獨的建構和應用。但在企業級應用中，我們更希望機器學習項目中的不同環節有序地建構成工作流（pipeline），這樣不同流程步驟更易于了解、可重制、也可以防止資料洩漏等問題。

常用的機器學習模組化工具，比如 Scikit-Learn，它的進階功能就覆寫了 pipeline，包含轉換器、模型和其他子產品等。

關于 Scikit-Learn 的應用方法可以參考ShowMeAI 📘機器學習實戰教程中的文章 📘SKLearn最全應用指南，也可以前往 Scikit-Learn 速查表擷取高密度的知識點清單。

但是，SKLearn 的簡易用法下，如果我們把外部工具庫，比如處理資料樣本不均衡的 imblearn合并到 pipeline 中，卻可能出現不相容問題，比如有如下報錯：

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string ‘passthrough’ ‘SMOTE()’ (type <class ‘imblearn.over_sampling._smote.base.SMOTE’>) doesn’t

本文以『客戶流失』為例，講解如何建構 SKLearn 流水線，具體地說包含：

建構一個流水線(pipeline) ，會覆寫到 Scikit-Learn、 imblearn 和 feature-engine 工具的應用
在編碼步驟（例如 one-hot 編碼）之後提取特征
建構特征重要度圖

最終解決方案如下圖所示：在一個管道中組合來自不同包的多個子產品。

我們下面的方案流程，覆寫了上述的不同環節：

步驟 ①：資料預處理：資料清洗
步驟 ②：特征工程：數值型和類别型特征處理
步驟 ③：樣本處理：類别非均衡處理
步驟 ④：邏輯回歸、xgboost、随機森林及投票內建
步驟 ⑤：超參數調優與特征重要度分析

💡 步驟0：準備和加載資料

我們先導入所需的工具庫。

# 資料處理與繪圖
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn工具庫
from sklearn.model_selection import train_test_split, RandomizedSearchCV, RepeatedStratifiedKFold, cross_validate

# pipeline流水線相關
from sklearn import set_config
from sklearn.pipeline import make_pipeline, Pipeline
from imblearn.pipeline import Pipeline as imbPipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# 常數列、缺失列、重複列 等處理
from feature_engine.selection import DropFeatures, DropConstantFeatures, DropDuplicateFeatures

# 非均衡處理、樣本采樣
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# 模組化模型
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.inspection import permutation_importance
from scipy.stats import loguniform

# 流水線可視化
set_config(display="diagram")

如果你之前沒有聽說過 imblearn 和 feature-engine 工具包，我們做一個簡單的說明：

📘Imblearn 可以處理類别不平衡的分類問題，内置不同的采樣政策

📘feature-engine 用于特征列的處理（常數列、缺失列、重複列等）

資料集：報紙訂閱使用者流失

我們這裡用到的資料集來自 Kaggle 比賽 Newspaper churn。資料集包括15856條現在或曾經訂閱該報紙的個人記錄。

🏆 實戰資料集下載下傳（百度網盤）：公衆号『ShowMeAI研究中心』回複『實戰』，或者點選這裡擷取本文 [14] 機器學習模組化應用流水線 pipeline 『Newspaper churn 資料集』

⭐ ShowMeAI官方GitHub：https://github.com/ShowMeAI-Hub

資料集包含人口統計資訊，如代表家庭收入的HH資訊、房屋所有權、小孩資訊、種族、居住年份、年齡範圍、語言；地理資訊如位址、州、市、縣和郵政編碼。另外，使用者選擇的訂閱期長，以及與之相關的收費資料。該資料集還包括使用者的來源管道。最後會有字段表征客戶是否仍然是我們的訂戶(是否流失)。

資料預處理與切分

我們先加載資料并進行預處理（例如将所有列名都小寫并将目标變量轉換為布爾值）。

# 讀取資料
data = pd.read_excel("NewspaperChurn new version.xlsx")

#資料預處理
data.columns = [k.lower().replace(" ", "_") for k in data.columns]
data.rename(columns={'subscriber':'churn'}, inplace=True)
data['churn'].replace({'NO':False, 'YES':True}, inplace=True)

# 類型轉換
data[data.select_dtypes(['object']).columns] = data.select_dtypes(['object']).apply(lambda x: x.astype('category'))

# 取出特征列和标簽列
X = data.drop("churn", axis=1)
y = data["churn"]

# 訓練集驗證集切分
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

預處理過後的資料應如下所示：

💡 步驟1：資料清洗

我們建構的 pipeline 流程的第一步是『資料清洗』，删除對預測沒有幫助的列（比如

id

類字段，恒定值字段，或者重複的字段）。

# 步驟1：資料清洗+字段處理
ppl = Pipeline([
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures())
])

上面的代碼建立了一個 pipeline 對象，它包含 3 個步驟：

drop_columns

、

drop_constant_values

、

drop_duplicates

。

這些步驟是元組形态的，第一個元素定義了步驟的名稱（如

drop_columns

），第二個元素定義了轉換器（如

DropFeatures()

）。

這些簡單的步驟，大家也可以通過 pandas 之類的外部工具輕松完成。但是，我們在組裝流水線時的想法是在pipeline中內建盡可能多的功能。

💡 步驟2：特征工程與資料變換

在前面剔除不相關的列之後，我們接下來做一下缺失值處理和特征工程。可以看到資料集包含不同類型的列（數值型和類别型），我們會針對這兩個類型定義兩個獨立的工作流程。

關于特征工程，可以檢視ShowMeAI 📘機器學習實戰教程中的文章 📘機器學習特征工程最全解讀。

# 資料處理與特征工程pipeline

ppl = Pipeline([
    # ① 剔除無關列
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures()),
    
    # ② 缺失值填充與數值/類别型特征處理
    ('cleaning', ColumnTransformer([
        # 2.1: 數值型字段缺失值填充與幅度縮放
        ('num',make_pipeline(
            SimpleImputer(strategy='mean'),
            MinMaxScaler()),
         make_column_selector(dtype_include='int64')
        ),
        # 2.2：類别型字段缺失值填充與獨熱向量編碼
        ('cat',make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')),
         make_column_selector(dtype_include='category')
        )])
    )
])

添加一個名為

clearning

的步驟，對應一個

ColumnTransformer

對象。

在

ColumnTransformer

中，設定了兩個新 pipeline：一個用于處理數值型，一個用于類别型處理。通過

make_column_selector

函數確定每次選出的字段類型是對的。

這裡使用 dtype_include 參數選擇對應類型的列，這個函數也可以提供列名清單或正規表達式來選擇。

💡 步驟3：類别非均衡處理（資料采樣）

在『使用者流失』和『欺詐識别』這樣的問題場景中，一個非常大的挑戰就是『類别不平衡』——也就是說，流失使用者相對于非流失使用者來說，數量較少。

這裡我們會采用到一個叫做

im``blearn

的工具庫來處理類别非均衡問題，它提供了一系列資料生成與采樣的方法來緩解上述問題。本次選用 SMOTE 采樣方法來對少的類别樣本進行重采樣。

SMOTE類别非均衡處理

添加 SMOTE 步驟後的 pipeline 如下：

# 總體處理pipeline

ppl = Pipeline([
    # ① 剔除無關列
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures()),
    
    # ② 缺失值填充與數值/類别型特征處理
    ('cleaning', ColumnTransformer([
        # 2.1: 數值型字段缺失值填充與幅度縮放
        ('num',make_pipeline(
            SimpleImputer(strategy='mean'),
            MinMaxScaler()),
         make_column_selector(dtype_include='int64')
        ),
        # 2.2：類别型字段缺失值填充與獨熱向量編碼
        ('cat',make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')),
         make_column_selector(dtype_include='category')
        )])
    ),
    # ③ 類别非均衡處理：重采樣
    ('smote', SMOTE())
])

pipeline 特征校驗

在最終建構內建分類器模型之前，我們檢視一下經過 pipeline 處理得到的特征名稱和其他資訊。

pipeline 對象提供了一個名為

get_feature_names_out()

的函數，我們可以通過它擷取特征名稱。但在使用它之前，我們必須在資料集上拟合。由于第 ③ 步 SMOTE 處理僅關注我們的标簽 y 資料，我們暫時忽略它并專注于第 ① 和 ② 步。

# 拟合資料，擷取pipeline建構的特征名稱和資訊
ppl_fts = ppl[0:4]
ppl_fts.fit(X_train, y_train)
features = ppl_fts.get_feature_names_out()
pd.Series(features)

結果如下所示：

0                    num__year_of_residence
1                             num__zip_code
2                       num__reward_program
3        cat__hh_income_$  20,000 - $29,999
4        cat__hh_income_$  30,000 - $39,999
                        ...                
12122               cat__source_channel_TMC
12123            cat__source_channel_TeleIn
12124           cat__source_channel_TeleOut
12125               cat__source_channel_VRU
12126          cat__source_channel_iSrvices
Length: 12127, dtype: object

由于獨熱向量編碼，許多帶着

cat_

開頭（代表 category）的特征名已被建立。

如果大家想得到上面流程圖一樣的 pipeline 可視化，隻需在代碼中做一點小小的修改，在調用 pipeline 對象之前在您的代碼中添加 set_config(display="diagram") 。

💡 步驟4：建構內建分類器

下一步我們訓練多個模型，并使用功能強大的內建模型（投票分類器）來解決目前問題。

關于這裡使用到的邏輯回歸、随機森林和 xgboost 模型，大家可以在 ShowMeAI 的 📘圖解機器學習算法教程中看到詳細的原理講解。

# 邏輯回歸模型
lr = LogisticRegression(warm_start=True, max_iter=400)
# 随機森林模型
rf = RandomForestClassifier()
# xgboost
xgb = XGBClassifier(tree_method="hist", verbosity=0, silent=True)
# 用投票器進行內建
lr_xgb_rf = VotingClassifier(estimators=[('lr', lr), ('xgb', xgb), ('rf', rf)], 
                             voting='soft')

定義內建模型後，我們也把它內建到我們的 pipeline 中。

# 總體處理pipeline

ppl = imbPipeline([
    # ① 剔除無關列
    ('drop_columns', DropFeatures(['subscriptionid'])),
    ('drop_constant_values', DropConstantFeatures(tol=1, missing_values='ignore')),
    ('drop_duplicates', DropDuplicateFeatures()),
    
    # ② 缺失值填充與數值/類别型特征處理
    ('cleaning', ColumnTransformer([
        # 2.1: 數值型字段缺失值填充與幅度縮放
        ('num',make_pipeline(
            SimpleImputer(strategy='mean'),
            MinMaxScaler()),
         make_column_selector(dtype_include='int64')
        ),
        # 2.2：類别型字段缺失值填充與獨熱向量編碼
        ('cat',make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')),
         make_column_selector(dtype_include='category')
        )])
    ),
    # ③ 類别非均衡處理：重采樣
    ('smote', SMOTE()),
    # ④ 投票器內建
    ('ensemble', lr_xgb_rf)
])

大家可能會注意到，我們在第1行中使用到的 Pipeline 替換成了 imblearn 的 imbPipeline 。這是很關鍵的一個處理，如果我們使用 SKLearn 的 pipeline，在拟合時會出現文初提到的錯誤：

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE()' (type <class 'imblearn.over_sampling._smote.base.SMOTE'>) doesn't

到這一步，我們就把基本的 pipeline 流程建構好了。

💡 步驟5：超參數調整和特征重要性

超參數調優

我們建構的整條模組化流水線中，很多元件都有超參數可以調整，這些超參數會影響最終的模型效果。對 pipeline 如何進行超參數調優呢，我們選用随機搜尋

RandomizedSearchCV

對超參數進行調優，代碼如下。

關于搜尋調參的詳細原理知識，大家可以檢視 ShowMeAI 在文章 📘網絡優化: 超參數調優、正則化、批歸一化和程式架構中的介紹。

大家特别注意代碼中的命名規則。

# 超參數調優
params = {
    'ensemble__lr__solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'ensemble__lr__penalty': ['none', 'l1', 'l2', 'elasticnet'],
    'ensemble__lr__C': loguniform(1e-5, 100),
    'ensemble__xgb__learning_rate': [0.1],
    'ensemble__xgb__max_depth': [7, 10, 15, 20],
    'ensemble__xgb__min_child_weight': [10, 15, 20, 25],
    'ensemble__xgb__colsample_bytree': [0.8, 0.9, 1],
    'ensemble__xgb__n_estimators': [300, 400, 500, 600],
    'ensemble__xgb__reg_alpha': [0.5, 0.2, 1],
    'ensemble__xgb__reg_lambda': [2, 3, 5],
    'ensemble__xgb__gamma': [1, 2, 3],
    'ensemble__rf__max_depth': [7, 10, 15, 20],
    'ensemble__rf__min_samples_leaf': [1, 2, 4],
    'ensemble__rf__min_samples_split': [2, 5, 10],
    'ensemble__rf__n_estimators': [300, 400, 500, 600],
}

# 随機搜尋調參
rsf = RepeatedStratifiedKFold(random_state=42)
clf = RandomizedSearchCV(ppl, params,scoring='roc_auc', verbose=2, cv=rsf)
clf.fit(X_train, y_train)

# 輸出資訊
print("Best Score: ", clf.best_score_)
print("Best Params: ", clf.best_params_)
print("AUC:", roc_auc_score(y_val, clf.predict(X_val)))

解釋一下上面代碼中的超參數命名：

第一個參數（ ensemble__ ）：我們的 VotingClassifier 的名稱
第二個參數（ lr__ ）：我們內建中使用的模型的名稱
第三個參數（ solver ）：模型相關超參數的名稱

因為這裡是類别不平衡場景，我們使用重複分層 k-fold (

RepeatedStratifiedKFold

）。

超參數調優這一步也不是必要的，在簡單的場景下，大家可以直接使用預設參數，或者在定義模型的時候敲定超參數。

特征重要度圖

為了不讓我們的模型成為黑箱模型，我們希望對模型做一些解釋，其中最重要的是歸因分析，我們希望了解哪些特征是重要的，這裡我們對特征重要度進行繪制。

# https://inria.github.io/scikit-learn-mooc/python_scripts/dev_features_importance.html
# 繪制特征重要度
def plot_feature_importances(perm_importance_result, feat_name):
    """ bar plot the feature importance """
    fig, ax = plt.subplots()


    indices = perm_importance_result['importances_mean'].argsort()
    plt.barh(range(len(indices)),
             perm_importance_result['importances_mean'][indices],
             xerr=perm_importance_result['importances_std'][indices])
    ax.set_yticks(range(len(indices)))
    ax.set_title("Permutation importance")
    
    tmp = np.array(feat_name)
    _ = ax.set_yticklabels(tmp[indices])


# 擷取特征名稱
ppl_fts = ppl[0:4]
ppl_fts.fit(X_train, y_train)
features = ppl_fts.get_feature_names_out()


# 用亂序法進行特征重要度計算和排列，以及繪圖
perm_importance_result_train = permutation_importance(clf, X_train, y_train, random_state=42)
plot_feature_importances(perm_importance_result_train, features)

上述代碼運作後的結果圖如下，我們可以看到特征

hh_income

在預測中占主導地位。由于這個特征其實是可以排序的（比如 30-40k 比 150-175k 要小），我們可以使用不同的編碼方式（比如使用 LabelEncoding 标簽編碼）。

參考資料

🏆 實戰資料集下載下傳（百度網盤）：公衆号『ShowMeAI研究中心』回複『實戰』，或者點選這裡擷取本文 [14] 機器學習模組化應用流水線 pipeline 『Newspaper churn 資料集』
⭐ ShowMeAI官方GitHub：https://github.com/ShowMeAI-Hub
📘 機器學習實戰教程: https://www.showmeai.tech/tutorials/41
📘 SKLearn最全應用指南: https://www.showmeai.tech/article-detail/203
📘 Imblearn 處理類别不平衡的分類: https://imbalanced-learn.org/stable/
📘 feature-engine 特征列的處理（常數列、缺失列、重複列等）: https://feature-engine.readthedocs.io/en/latest/
📘 機器學習實戰教程: http://showmeai.tech/tutorials/41
📘 機器學習特征工程最全解讀: https://www.showmeai.tech/article-detail/208
📘 圖解機器學習算法教程: http://showmeai.tech/tutorials/34
📘 網絡優化: 超參數調優、正則化、批歸一化和程式架構: https://www.showmeai.tech/article-detail/218
📘 Scikit-Learn 速查表: https://www.showmeai.tech/article-detail/108

機器學習與流水線（pipeline）簡介

💡 步驟0：準備和加載資料

資料集：報紙訂閱使用者流失

資料預處理與切分

💡 步驟1：資料清洗

💡 步驟2：特征工程與資料變換

💡 步驟3：類别非均衡處理（資料采樣）

SMOTE類别非均衡處理

pipeline 特征校驗

💡 步驟4：建構內建分類器

💡 步驟5：超參數調整和特征重要性

超參數調優

特征重要度圖

參考資料

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

分類算法的評價名額

K-近鄰算法以及圖像分類應用

weka之NB算法

使用weka的select attribute

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告