寫在前面:寫這個學習筆記原因有三點,一是最近在系統的學習大佬李航的《統計學習方法》,做筆記以鞏固自己的學習;二、網上有很多類似的學習筆記,有的寫不全,半途而廢。有的則寫的太過繁雜。我盡力将這本書的學習筆記寫完,也希望它是滿滿的純幹貨。三是跟我另一套學習筆記【吳恩達機器學習課程筆記+代碼實作】一樣,算法理論隻是工具,跟重要的是如何使用這些工具,我這會展示具體的代碼實作。
例1.1 最小二乘法實作多項式函數拟合問題和正則化
訓練資料集 T = ( x i , y i ) ( i = 1 , 2 , 3... , m ) T =(x_i, y_i)(i=1, 2, 3...,m) T=(xi,yi)(i=1,2,3...,m)
n次的多項式: H ( x ) = w 0 + w 1 x + w 2 x 2 + . . . w n x n H(x)=w_0+w_1x+w_2x^2+...w_nx^n H(x)=w0+w1x+w2x2+...wnxn
w ( w 0 , w 1 , w 2 , . . . , w n ) w(w_0,w_1,w_2,...,w_n) w(w0,w1,w2,...,wn)為參數
按照經驗風險最小化的政策, 求解參數 w ( w 0 , w 1 , w 2 , . . . , w n ) w(w_0,w_1,w_2,...,w_n) w(w0,w1,w2,...,wn) , 即多項式的系數。
具體地, 求以下經驗風險最小化(殘差平方和):
L ( w ) = 1 2 ∑ i = 1 n ( h ( x i ) − y i ) 2 L(w)=\frac{1}{2}\sum_{i=1}^n(h(x_i)-y_i)^2 L(w)=21i=1∑n(h(xi)−yi)2
即,求 m i n L ( w ) minL(w) minL(w)
%matplotlib inline
#IPython的内置magic函數,可以省掉plt.show(),在其他IDE中是不會支援的
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import leastsq
sns.set(style="whitegrid",color_codes=True)
書中例1.1的目标函數為 y = s i n 2 π x y=sin2{\pi}x y=sin2πx,資料中有部分噪音幹擾
#目标函數
def obj_func(x):
return np.sin(2*np.pi*x)
#多項式拟合函數
def h_func(p,x):
f = np.poly1d(p)
return f(x)
#損失函數
def cost_func(p,x,y):
cost = h_func(p,x)-y
return cost
x = np.linspace(0,1,11)
y0 = obj_func(x)
#加入噪音
y = [np.random.normal(0,0.1)+y1 for y1 in y0]
def fit_func(M=0):
#随機生成M+1個參數
p = np.random.rand(M+1)
final_p = leastsq(cost_func,p,args=(x,y))
print('parameters:',final_p[0])
#可視化
xPoints = np.linspace(0,1,1000)
plt.plot(xPoints,obj_func(xPoints),label = 'obj_function')
plt.plot(xPoints,h_func(final_p[0],xPoints),color = 'red',label = 'fit_function')
plt.scatter(x,y,color="black",label="Sample Point",linewidth=1) #畫樣本點
plt.legend()
return final_p
parameters: [0.02573826]
parameters: [-1.24431401 0.64789527]
parameters: [ 22.74841778 -34.14333587 11.78849017 -0.17415415]
M_10 = fit_func(M=10)
#出現過拟合現象
parameters: [-1.46280858e+05 7.48133454e+05 -1.64164833e+06 2.01857571e+06
-1.52326836e+06 7.25781151e+05 -2.16042144e+05 3.81998309e+04
-3.59036850e+03 1.40170459e+02 -1.04509370e-01]
正則化
結果顯示過拟合, 引入參數向量的L2範數的正則化項(regularization),降低過拟合
L ( w ) = ∑ i = 1 n ( h ( x i ) − y i ) 2 + λ 2 ∣ ∣ w ∣ ∣ 2 L(w)=\sum_{i=1}^n(h(x_i)-y_i)^2+\frac{\lambda }{2}||w||^2 L(w)=∑i=1n(h(xi)−yi)2+2λ∣∣w∣∣2。
- L1範數: λ ∣ ∣ w ∣ ∣ \lambda ||w|| λ∣∣w∣∣
- L2範數: λ 2 ∣ ∣ w ∣ ∣ 2 \frac{\lambda }{2}||w||^2 2λ∣∣w∣∣2
def regularization_func(p,x,y):
re = 0.001
l2 = 0.5 * re * np.square(p)
return cost_func(p,x,y) + l2
def fit_re_func(M=0):
#随機生成M+1個參數
p = np.random.rand(M+1)
final_p = leastsq(regularization_func,p,args=(x,y))
print('parameters:',final_p[0])
return final_p
#可視化
# M = 10
regularization = fit_re_func(M=10)
xPoints = np.linspace(0,1,1000)
plt.plot(xPoints,h_func(M_10[0],xPoints),label = 'fit_function')
plt.plot(xPoints,h_func(regularization[0],xPoints),color = 'red',label = 'regularization')
plt.scatter(x,y,color="black",label="Sample Point",linewidth=1) #畫樣本點
plt.legend()
parameters: [-28.18008089 24.94652739 28.50036324 -28.36469244 -5.37687798
15.43657785 7.18333567 -7.43031457 -14.75873712 8.69741938
-0.50156785]
<matplotlib.legend.Legend at 0xb26da90>