【Python資料科學手冊】專題：線性回歸

線性回歸模型是解決回歸任務的好起點。

你可能對線性回歸模型最簡單的形式（即對資料拟合一條直線）已經很熟悉了，不過經過擴充，這些模型可以對更複雜的資料行為進行模組化。

首先導入常用的程式庫：

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

1、簡單線性回歸

首先來介紹最廣為人知的線性回歸模型——将資料拟合成一條直線。直線拟合的模型方程為y=ax+by=ax+b，其中aa是直線斜率，bb是直線截距。

下面的資料，它們是從斜率為2、截距為-5 的直線中抽取的散點

rng· = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
plt.scatter(x, y);

用Scikit-Learn 的LinearRegression 評估器來拟合資料，并獲得最佳拟合直線

from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)

model.fit(x[:, np.newaxis], y)

xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit);

資料的斜率和截距都在模型的拟合參數中，Scikit-Learn 通常會在參數後面加一條下劃線，即coef_ 和intercept_：

print("Model slope:    ", model.coef_[0])
print("Model intercept:", model.intercept_)

可以看到，拟合結果與真實值非常接近。

LinearRegression 評估器能做的可遠不止這些——除了簡單的直線拟合，它還可以處理多元度的線性回歸模型：

y=a0+a1x1+a2x2+⋯

裡面有多個x 變量。從幾何學的角度看，這個模型是拟合三維空間中的一個平面，或者是為更高次元的資料點拟合一個超平面。

rng = np.random.RandomState(1)
X = 10 * rng.rand(100, 3)
y = 0.5 + np.dot(X, [1.5, -2., 1.])

model.fit(X, y)
print(model.intercept_)
print(model.coef_)
0.5
[ 1.5 -2.   1. ]

其中y 變量是由3個随機的x變量線性組合而成，線性回歸模型還原了方程的系數。

通過這種方式，就可以用一個LinearRegression 評估器拟合資料的回歸直線、平面和超平面了。

局限性：變量限制線上性關系上

2、基函數回歸

可以通過基函數對原始資料進行變換，進而将變量間的線性回歸模型轉換為非線性回歸模型。

這個方法的多元模型是：y=a0+a1x1+a2x2+a3x3+⋯

其中一維的輸入變量xx轉換成了三維變量x₁,x₂,x₃。讓xn=fn(x)，這裡的fn(x)是轉換資料的函數。

假如fn(x)=xn，那麼模型就會變成多項式回歸：y=a0+a1x+a2x2+a3x3+⋯

需要注意的是，這個模型仍然是一個線性模型，也就是說系數an彼此不會相乘或相除。

1、多項式函數

多項式投影非常有用，是以Scikit-Learn 内置了PolynomialFeatures 轉換器實作這個功能：

from sklearn.preprocessing import PolynomialFeatures
x = np.array([2, 3, 4])
poly = PolynomialFeatures(3, include_bias=False)
poly.fit_transform(x[:, None])

轉換器通過指數函數，将一維數組轉換成了三維數組。這個新的高維數組之後可以放在多項式回歸模型中。

我們建立一個7 次多項式回歸模型：

from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(PolynomialFeatures(7),
                    LinearRegression())

資料經過轉換之後，我們就可以用線性模型來拟合x 和y 之間更複雜的關系了。

rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)

poly_model.fit(x[:, np.newaxis], y)
yfit = poly_model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit);

通過運用7 次多項式基函數，這個線性模型可以對非線性資料拟合出極好的效果！

2. 高斯基函數

例如，有一種常用的拟合模型方法使用的并不是一組多項式基函數，而是一組高斯基函數。

圖中的陰影部分代表不同規模基函數，把它們放在一起時就會産生平滑的曲線。

from sklearn.base import BaseEstimator, TransformerMixin

class GaussianFeatures(BaseEstimator, TransformerMixin):
    """Uniformly spaced Gaussian features for one-dimensional input"""
    
    def __init__(self, N, width_factor=2.0):
        self.N = N
        self.width_factor = width_factor
    
    @staticmethod
    def _gauss_basis(x, y, width, axis=None):
        arg = (x - y) / width
        return np.exp(-0.5 * np.sum(arg ** 2, axis))
        
    def fit(self, X, y=None):
        # create N centers spread along the data range
        self.centers_ = np.linspace(X.min(), X.max(), self.N)
        self.width_ = self.width_factor * (self.centers_[1] - self.centers_[0])
        return self
        
    def transform(self, X):
        return self._gauss_basis(X[:, :, np.newaxis], self.centers_,
                                 self.width_, axis=1)
    
gauss_model = make_pipeline(GaussianFeatures(20),
                            LinearRegression())
gauss_model.fit(x[:, np.newaxis], y)
yfit = gauss_model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit)
plt.xlim(0, 10);

3、正則化

雖然線上性回歸模型中引入基函數會讓模型變得更加靈活，但是也很容易造成過拟合。例如，如果選擇了太多高斯基函數，那麼最終的拟合結果看起來可能并不好。

model = make_pipeline(GaussianFeatures(30),
                      LinearRegression())
model.fit(x[:, np.newaxis], y)

plt.scatter(x, y)
plt.plot(xfit, model.predict(xfit[:, np.newaxis]))

plt.xlim(0, 10)
plt.ylim(-1.5, 1.5);

如果将資料投影到30 維的基函數上，模型就會變得過于靈活，進而能夠适應資料中不同位置的異常值。如果将高斯基函數的系數畫出來，就可以看到過拟合的原因。

def basis_plot(model, title=None):
    fig, ax = plt.subplots(2, sharex=True)
    model.fit(x[:, np.newaxis], y)
    ax[0].scatter(x, y)
    ax[0].plot(xfit, model.predict(xfit[:, np.newaxis]))
    ax[0].set(xlabel='x', ylabel='y', ylim=(-1.5, 1.5))
    
    if title:
        ax[0].set_title(title)

    ax[1].plot(model.steps[0][1].centers_,
               model.steps[1][1].coef_)
    ax[1].set(xlabel='basis location',
              ylabel='coefficient',
              xlim=(0, 10))
    
model = make_pipeline(GaussianFeatures(30), LinearRegression())
basis_plot(model)

下面那幅圖顯示了每個位置上基函數的振幅。當基函數重疊的時候，通常就表明出現了過拟合：相鄰基函數的系數互相抵消。這顯然是有問題的，如果對較大的模型參數進行懲罰（penalize），進而抑制模型劇烈波動，應該就可以解決這個問題了。這個懲罰機制被稱為正則化（regularization），有幾種不同的表現形式。

1. 嶺回歸（L2範數正則化）

正則化最常見的形式可能就是嶺回歸（ridge regression，或者L2 範數正則化），有時也被稱為吉洪諾夫正則化（Tikhonov regularization）。其處理方法是對模型系數平方和（L2 範數）進行懲罰，模型拟合的懲罰項為：

α 是一個自由參數，用來控制懲罰的力度。這種帶懲罰項的模型内置在Scikit-Learn的Ridge 評估器中

from sklearn.linear_model import Ridge
model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))
basis_plot(model, title='Ridge Regression')

參數α 是控制最終模型複雜度的關鍵。如果α → 0，那麼模型就恢複到标準線性回歸結果；如果α → ∞，那麼所有模型響應都會被壓制。

2. Lasso正則化（L1範數）

另一種常用的正則化被稱為Lasso，其處理方法是對模型系數絕對值的和（L1 範數）進行懲罰:

雖然它在形式上非常接近嶺回歸，但是其結果與嶺回歸差别很大。例如，由于其幾何特性，Lasso 正則化傾向于建構稀疏模型；也就是說，它更喜歡将模型系數設定為0。

模型系數的L1- 範數正則化實作的

from sklearn.linear_model import Lasso
model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))
basis_plot(model, title='Lasso Regression')

4、案例：預測自行車流量

首先加載兩個資料集，用日期作索引:

import pandas as pd
counts = pd.read_csv('datalab/5666/FremontBridge.csv', index_col='Date', parse_dates=True)
weather = pd.read_csv('datalab/5666/BicycleWeather.csv', index_col='DATE', parse_dates=True)

計算每一天的自行車流量，将結果放到一個新的DataFrame中

daily = counts.resample('d').sum()
daily['Total'] = daily.sum(axis=1)
daily = daily[['Total']] # remove other columns

我們發現同一周内每一天的模式都是不一樣的。是以，我們在資料中加上7 列0~1 值表示星期幾:

days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for i in range(7):
    daily[days[i]] = (daily.index.dayofweek == i).astype(float)

我們覺得騎車人數在節假日也有所變化。是以，再增加一清單示當天是否為節假日:

from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016')
daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
daily['holiday'].fillna(0, inplace=True)

我們還認為白晝時間也會影響騎車人數。是以，用标準的天文計算來添加這列資訊:

def hours_of_daylight(date, axis=23.44, latitude=47.61):
    """Compute the hours of daylight for the given date"""
    days = (date - pd.datetime(2000, 12, 21)).days
    m = (1. - np.tan(np.radians(latitude))
         * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
    return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.

daily['daylight_hrs'] = list(map(hours_of_daylight, daily.index))
daily[['daylight_hrs']].plot()
plt.ylim(8, 17)

我們還可以增加每一天的平均氣溫和總降雨量。除了降雨量的數值之外，再增加一個标記表示是否下雨（是否降雨量為0）

# 溫度是按照1/10攝氏度統計的，首先轉換為攝氏度
weather['TMIN'] /= 10
weather['TMAX'] /= 10
weather['Temp (C)'] = 0.5 * (weather['TMIN'] + weather['TMAX'])

# precip is in 1/10 mm; convert to inches
weather['PRCP'] /= 254
weather['dry day'] = (weather['PRCP'] == 0).astype(int)

daily = daily.join(weather[['PRCP', 'Temp (C)', 'dry day']])

最後，增加一個從1 開始遞增的計數器，表示一年已經過去了多少天。這個特征可以讓我們看到每一年自行車流量的增長或減少：

daily['annual'] = (daily.index - daily.index[0]).days / 365.

資料已經準備就緒，來看看前幾行：

daily.head()

有了這些資料之後，就可以選擇需要使用的列，然後對資料建立線性回歸模型。我們不在模型中使用截距，而是設定fit_intercept = False，因為每一天的總流量（Total 字段）基本上可以作為當天的截距。

# Drop any rows with null values
daily.dropna(axis=0, how='any', inplace=True)

column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday',
                'daylight_hrs', 'PRCP', 'dry day', 'Temp (C)', 'annual']
X = daily[column_names]
y = daily['Total']

model = LinearRegression(fit_intercept=False)
model.fit(X, y)
daily['predicted'] = model.predict(X)

最後，對比自行車真實流量（Total 字段）與預測流量（predicted 字段）

daily[['Total', 'predicted']].plot(alpha=0.5);

顯然，我們丢失了一些關鍵特征，尤其是夏天的預測資料。要麼是由于特征沒有收集全（即可能還有其他因素會影響人們是否騎車），要麼是有一些非線性關系我們沒有考慮到（例如，可能人們在溫度過高或過低時都不願意騎車）

評估各個特征對每日自行車流量的影響：

params = pd.Series(model.coef_, index=X.columns)
params

如果不對這些資料的不确定性進行評估，那麼它們很難具有解釋力。可以用自舉重采樣方法快速計算資料的不确定性：

from sklearn.utils import resample
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
              for i in range(1000)], 0)
有了估計誤差之後，再來看這些結果：

print(pd.DataFrame({'effect': params.round(0),
                    'error': err.round(0)}))

首先，星期特征是比較穩定的，工作日騎車的人數顯然比周末和節假日要多。其次，白晝時間每增加1 小時，就平均增加129 ± 9 個騎車的人；而溫度每上升1 度，則增加65 ± 4 個騎車的人；如果那天沒下雨，那麼騎車人數增加546 ± 33 人；降雨量每增加1 英寸，騎車人數減少665 ± 62 人。當所有影響因素都生效之後，一年中每多一天騎車人數增加（日環比增幅）28 ± 18 人。

我們的模型的确丢失了一些重要資訊。例如，變量的非線性影響因素（例如降雨和寒冷天氣的影響）和非線性趨勢（例如人們在溫度過高或過低時可能都不願意騎車）在模型中都沒有展現。另外，我們丢掉了一些細顆粒度的資料（例如下雨天的早晨和下雨天的傍晚之間的差異），還忽略了相鄰日期彼此間的相關性（例如下雨的星期二對星期三騎車人數的影響，或者滂沱大雨之後意外的雨過天晴對騎車人數的影響），這些都可能對騎車人數産生影響。現在你手上已經有了工具，如果願意，可以進一步進行分析。

【Python資料科學手冊】專題：線性回歸

1、簡單線性回歸

2、基函數回歸

3、正則化

4、案例：預測自行車流量

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入