內建方法系列--bagging及基于scikit-learn的示例

本篇是內建方法系列（1）---bagging方法。

首先簡單介紹下scikit-learn，這是一個用python實作的機器學習庫。它的特點如下：

簡單高效，可以用于資料挖掘和資料分析；

人人可用，可以用于多種場景；

基于Numpy， SciPy 和matplotlib，其中numpy和scipy是python實作的科學計算庫，matplotlib是畫圖庫；

開源，可商用----BSD lisense

內建方法的目标是通過組合多個基準預測，然後利用學習算法使得內建算法相對單個預測具有更好的泛化能力或魯棒性。

內建方法主要有兩大家族：

平均法，即獨立訓練多個預測，然後對多個預測取平均得到最終預測。一般來講，組合預測器通常比單個預測器要好，因為組合多個預測器會使得方差降低。

這類主要包含 bagging 方法，随機森林等。
boosting方法，需要建立一系列預測器，并且逐漸縮小組合預測器的偏差。通過組合多個弱模型來得到一個強內建器。

這類方法主要包括Adaboost， GBDT等。

Bagging

內建算法中，bagging方法會建構一類算法，這些算法會在原始訓練集中的随機自己中建構黑箱預測器，然後将這些預測器聚合起來，進而得到最終的預測器。這類方法在建構過程中借助于随機子集，然後将多個基準預測內建起來，如此一來可以減小基準預測器（比如決策樹）的方差。很多情況下，bagging 方法是一種比較簡單的并且可以提高單個模型性能的方式。bagging 方法可以預防過拟合，它在健壯和複雜模型中效果更好，而boosting方法則通常适用于弱模型。

Bagging 方法基于從原始訓練集中随機選取子集，它會由于訓練集中的随機子集采樣方法不同而形式各異。一般可以分為以下四種：

如果随機子集是從訓練集中随機采樣而來的子集，則這種算法就是Pasting算法 [B1999]。
如果采樣過程是有放回的，則這種方法就是Bagging方法 [B1996]。
如果采樣過程是對特征集的随機采樣，則這種方法稱為随機子空間法 [H1998]。
如果采樣既包含了樣本随機，又包含了随機選擇特征子集，則這種方法稱為随機更新檔法 [LG2012]。

在scikit-learn中，bagging 對應一個BaggingClassifier 或 BaggingRegressor，需要使用者給定一個基準預測器和一個用來指定子集采樣方法的參數。參數max_samples用來控制子集的大小，max_features用來控制特征子集中特征的個數，bootstrap 用來控制樣本采樣是否有放回，bootstrap_features用來控制特征采樣是否有放回，泛化誤差可以通過設定oob_score=True并且利用out-of-bag的樣本來估計。

下面是一個如何初始化bagging內建算法的示例，其中基準預測器是KNeighborsClassifier，每個預測器利用50%的随機樣本和50%的特征。

>>> from sklearn.ensemble import BaggingClassifier

>>> from sklearn.neighbors import KNeighborsClassifier

>>> bagging = BaggingClassifier(KNeighborsClassifier(),...                             max_samples=0.5, max_features=0.5)

完整示例如下：

print(__doc__)

# Author: Gilles Louppe <[email protected]>

# License: BSD 3 clause

import numpy as np    ##導入numpy python 中科學計算的庫，類似matlab

import matplotlib.pyplot as plt    ## 畫圖庫

from sklearn.ensemble import BaggingRegressor

from sklearn.tree import DecisionTreeRegressor

# Settings

n_repeat = 50       # Number of iterations for computing expectations

n_train = 50        # Size of the training set

n_test = 1000       # Size of the test set

noise = 0.1         # Standard deviation of the noise

np.random.seed(0)

# Change this for exploring the bias-variance decomposition of other

# estimators. This should work well for estimators with high variance (e.g.,

# decision trees or KNN), but poorly for estimators with low variance (e.g.,

# linear models).

estimators = [("Tree", DecisionTreeRegressor()),
             ("Bagging(Tree)", BaggingRegressor(DecisionTreeRegressor()))]

n_estimators = len(estimators)

# Generate data   生成資料

def f(x):
    x = x.ravel()

    return np.exp(-x ** 2) + 1.5 * np.exp(-(x - 2) ** 2)

def generate(n_samples, noise, n_repeat=1):
    X = np.random.rand(n_samples) * 10 - 5
    X = np.sort(X)

    if n_repeat == 1:
        y = f(X) + np.random.normal(0.0, noise, n_samples)
    else:
        y = np.zeros((n_samples, n_repeat))

        for i in range(n_repeat):
            # y的第i列

            y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)



    X = X.reshape((n_samples, 1))

    return X, y

X_train = []

y_train = []

for i in range(n_repeat):
    X, y = generate(n_samples=n_train, noise=noise)
    X_train.append(X)
    y_train.append(y)

X_test, y_test = generate(n_samples=n_test, noise=noise, 
                          n_repeat=n_repeat)

# Loop over estimators to compare

for n, (name, estimator) in enumerate(estimators):
    # Compute predictions
    y_predict = np.zeros((n_test, n_repeat))

    for i in range(n_repeat):
        estimator.fit(X_train[i], y_train[i])
        y_predict[:, i] = estimator.predict(X_test)

    # Bias^2 + Variance + Noise decomposition of the mean squared error
    y_error = np.zeros(n_test)

    for i in range(n_repeat):
        for j in range(n_repeat):
            y_error += (y_test[:, j] - y_predict[:, i]) ** 2

    y_error /= (n_repeat * n_repeat)

    y_noise = np.var(y_test, axis=1)
    y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2
    y_var = np.var(y_predict, axis=1)

    print("{0}: {1:.4f} (error) = {2:.4f} (bias^2) "
          " + {3:.4f} (var) + {4:.4f} (noise)".format(name,
                                                      np.mean(y_error),
                                                      np.mean(y_bias),
                                                      np.mean(y_var),
                                                      np.mean(y_noise)))

    # Plot figures   畫圖
    plt.subplot(2, n_estimators, n + 1)
    plt.plot(X_test, f(X_test), "b", label="$f(x)$")
    plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$")

    for i in range(n_repeat):
        if i == 0:
            plt.plot(X_test, y_predict[:, i], "r", label="$\^y(x)$")
        else:
            plt.plot(X_test, y_predict[:, i], "r", alpha=0.05)

    plt.plot(X_test, np.mean(y_predict, axis=1), "c",
             label="$\mathbb{E}_{LS} \^y(x)$")

    plt.xlim([-5, 5])
    plt.title(name)

    if n == 0:
        plt.legend(loc="upper left", prop={"size": 11})

    plt.subplot(2, n_estimators, n_estimators + n + 1)
    plt.plot(X_test, y_error, "r", label="$error(x)$")
    plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),
    plt.plot(X_test, y_var, "g", label="$variance(x)$"),
    plt.plot(X_test, y_noise, "c", label="$noise(x)$")

    plt.xlim([-5, 5])
    plt.ylim([0, 0.1])

    if n == 0:
        plt.legend(loc="upper left", prop={"size": 11})

plt.show()

輸出結果如下：

Tree: 0.0255 (error) = 0.0003 (bias^2)  + 0.0152 (var) + 0.0098 (noise)
Bagging(Tree): 0.0196 (error) = 0.0004 (bias^2)  + 0.0092 (var) + 0.0098 (noise)

輸出圖形如下：

內建方法系列--bagging及基于scikit-learn的示例

參考資料

http://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator

http://scikit-learn.org/stable/auto_examples/ensemble/plot_bias_variance.html#example-ensemble-plot-bias-variance-py

[B1999]

L. Breiman, “Pasting small votes for classification in large databases and on-line”, Machine Learning, 36(1), 85-103, 1999.

[B1996]

L. Breiman, “Bagging predictors”, Machine Learning, 24(2), 123-140, 1996.

[H1998]

T. Ho, “The random subspace method for constructing decision forests”, Pattern Analysis and Machine Intelligence, 20(8), 832-844, 1998.

[LG2012]

G. Louppe and P. Geurts, “Ensembles on Random Patches”, Machine Learning and Knowledge Discovery in Databases, 346-361, 2012.