文章目錄

1. PCA介紹
- 1.1 概念
- 1.2 降維及恢複示意圖
2. PCA 算法模拟
- 2.1 Numpy實作
- 2.2 sklearn 包實作
3. 執行個體：pca進行人臉降維

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

1. PCA介紹

1.1 概念

思想：

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

dots = np.array([[1, 1.5], [2, 1.5], [3, 3.6], [4, 3.2], [5, 5.5]])

def cross_point(x0, y0):
    """
    1. line1: y = x
    2. line2: y = -x + b => x = b/2
    3. [x0, y0] is in line2 => b = x0 + y0

    => x1 = b/2 = (x0 + y0) / 2
    => y1 = x1
    """
    x1 = (x0 + y0) / 2
    return x1, x1


plt.figure(figsize=(8, 6), dpi=144)
plt.title('2-dimension to 1-dimension')

plt.xlim(0, 8)
plt.ylim(0, 6)
ax = plt.gca()                                  # gca 代表目前坐标軸，即 'get current axis'
ax.spines['right'].set_color('none')            # 隐藏坐标軸
ax.spines['top'].set_color('none')

plt.scatter(dots[:, 0], dots[:, 1], marker='s', c='b')
plt.plot([0.5, 6], [0.5, 6], '-r')
for d in dots:
    x1, y1 = cross_point(d[0], d[1])
    plt.plot([d[0], x1], [d[1], y1], '--b')
    plt.scatter(x1, y1, marker='o', c='r')
plt.annotate(r'projection point',
             xy=(x1, y1), xycoords='data',
             xytext=(x1 + 0.5, y1 - 0.5), fontsize=10,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.annotate(r'vector $u^{(1)}$',
             xy=(4.5, 4.5), xycoords='data',
             xytext=(5, 4), fontsize=10,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

圖中正方形的點是原始資料經過預處理後（歸一化、縮放）的資料，圓形的點是從一維恢複到二維後的資料。同時，我們畫出主成分特征向量u1,u2 。根據上圖，來介紹幾個有意思的結論：首先，圓形的點實際上就是方形的點在向量u1,u2 所在直線上的投影。所謂PCA資料恢複，并不是真正的恢複，隻是把降維後的坐标轉換為原坐标系中的坐标而已。針對我們的例子，隻是把由向量u1,u2 決定的一維坐标系中的坐标轉換為原始二維坐标系中的坐标。其次，主成分特征向量u1,u2是互相垂直的。再次，方形點和圓形點之間的距離，就是PCA資料降維後的誤差。

1.2 降維及恢複示意圖

plt.figure(figsize=(8, 8), dpi=144)

plt.title('Physcial meanings of PCA')

ymin = xmin = -1
ymax = xmax = 1
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
ax = plt.gca()                                  # gca 代表目前坐标軸，即 'get current axis'
ax.spines['right'].set_color('none')            # 隐藏坐标軸
ax.spines['top'].set_color('none')

plt.scatter(norm[:, 0], norm[:, 1], marker='s', c='b')
plt.scatter(Z[:, 0], Z[:, 1], marker='o', c='r')
plt.arrow(0, 0, U[0][0], U[1][0], color='r', linestyle='-')
plt.arrow(0, 0, U[0][1], U[1][1], color='r', linestyle='--')
plt.annotate(r'$U_{reduce} = u^{(1)}$',
             xy=(U[0][0], U[1][0]), xycoords='data',
             xytext=(U_reduce[0][0] + 0.2, U_reduce[1][0] - 0.1), fontsize=10,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.annotate(r'$u^{(2)}$',
             xy=(U[0][1], U[1][1]), xycoords='data',
             xytext=(U[0][1] + 0.2, U[1][1] - 0.1), fontsize=10,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.annotate(r'raw data',
             xy=(norm[0][0], norm[0][1]), xycoords='data',
             xytext=(norm[0][0] + 0.2, norm[0][1] - 0.2), fontsize=10,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
plt.annotate(r'projected data',
             xy=(Z[0][0], Z[0][1]), xycoords='data',
             xytext=(Z[0][0] + 0.2, Z[0][1] - 0.1), fontsize=10,
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

Text(0.03390904029252009, -0.28050757997562326, 'projected data')

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

2. PCA 算法模拟

2.1 Numpy實作

A = np.array([[3, 2000], 
              [2, 3000], 
              [4, 5000], 
              [5, 8000], 
              [1, 2000]], dtype='float')

# 資料歸一化
mean = np.mean(A, axis=0)
norm = A - mean
# 資料縮放
scope = np.max(norm, axis=0) - np.min(norm, axis=0)
norm = norm / scope
norm

array([[ 0.        , -0.33333333],
       [-0.25      , -0.16666667],
       [ 0.25      ,  0.16666667],
       [ 0.5       ,  0.66666667],
       [-0.5       , -0.33333333]])

U, S, V = np.linalg.svd(np.dot(norm.T, norm))
U

array([[-0.67710949, -0.73588229],
       [-0.73588229,  0.67710949]])

U_reduce = U[:, 0].reshape(2,1)
U_reduce

array([[-0.67710949],
       [-0.73588229]])

R = np.dot(norm, U_reduce)
R

array([[ 0.2452941 ],
       [ 0.29192442],
       [-0.29192442],
       [-0.82914294],
       [ 0.58384884]])

Z = np.dot(R, U_reduce.T)
Z

array([[-0.16609096, -0.18050758],
       [-0.19766479, -0.21482201],
       [ 0.19766479,  0.21482201],
       [ 0.56142055,  0.6101516 ],
       [-0.39532959, -0.42964402]])

np.multiply(Z, scope) + mean

array([[2.33563616e+00, 2.91695452e+03],
       [2.20934082e+00, 2.71106794e+03],
       [3.79065918e+00, 5.28893206e+03],
       [5.24568220e+00, 7.66090960e+03],
       [1.41868164e+00, 1.42213588e+03]])

2.2 sklearn 包實作

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

def std_PCA(**argv):
    # MinMaxScaler對資料進行預處理
    scaler = MinMaxScaler()
    # PCA算法
    pca = PCA(**argv)
    pipeline = Pipeline([('scaler', scaler),
                         ('pca', pca)])
    return pipeline

pca = std_PCA(n_components=1)
R2 = pca.fit_transform(A)
R2

array([[-0.2452941 ],
       [-0.29192442],
       [ 0.29192442],
       [ 0.82914294],
       [-0.58384884]])

array([[2.33563616e+00, 2.91695452e+03],
       [2.20934082e+00, 2.71106794e+03],
       [3.79065918e+00, 5.28893206e+03],
       [5.24568220e+00, 7.66090960e+03],
       [1.41868164e+00, 1.42213588e+03]])

3. 執行個體：pca進行人臉降維

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import fetch_olivetti_faces

# fetch_olivetti_faces函數可以幫助我們截取中間部分，隻留下臉部特征
faces = fetch_olivetti_faces(data_home='datasets/')

X = faces.data
y = faces.target
image = faces.images
print("data:{}, label:{}, image:{}".format(X.shape, y.shape, image.shape))

data:(400, 4096), label:(400,), image:(400, 64, 64)

檢視部分圖像

target_names = np.array(["c%d" % i for i in np.unique(y)])
target_names

array(['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10',
       'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19',
       'c20', 'c21', 'c22', 'c23', 'c24', 'c25', 'c26', 'c27', 'c28',
       'c29', 'c30', 'c31', 'c32', 'c33', 'c34', 'c35', 'c36', 'c37',
       'c38', 'c39'], dtype='<U3')

plt.figure(figsize=(12, 11), dpi=100)

# 這裡顯示兩個人的各5張圖像
shownum = 40
# 提取前k個人的名字
title = target_names[:int(shownum/10)]
j = 1

# 每個人的10張圖像主題曲前面的5張來展示
for i in range(shownum):
    if i%10 < 5:
        plt.subplot(int(shownum/10),5,j)
        plt.title("people:"+title[int(i/10)])
        plt.imshow(image[i],cmap=plt.cm.gray)
        j+=1

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

提取全部40人的第一張圖像,并進行展示

subimage = None

for i in range(len(image)):
    if i%10 == 0:
        if subimage is not None:
#             print("subimage.shape:{},image[i].shape:{}",subimage.shape, image[i].shape)
            subimage = np.concatenate((subimage, image[i].reshape(1,64,64)), axis=0)
        else:
            subimage = image[i].reshape(1,64,64)
            
plt.figure(figsize=(12,6), dpi=100)

for i in range(subimage.shape[0]):
    plt.subplot(int(subimage.shape[0]/10), 10, i+1)
    plt.imshow(subimage[i], cmap=plt.cm.gray)
    plt.title("name:"+target_names[i])
    plt.axis('off')

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

劃分資料集

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=4)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((320, 4096), (80, 4096), (320,), (80,))

使用svm來實作人臉識别

from sklearn.svm import SVC

# 指定SVC的class_weight參數，讓SVC模型能根據訓練樣本的數量來均衡地調整權重
clf = SVC(class_weight='balanced')
# 訓練
clf.fit(X_train, y_train)
# 計算得分
trainscore = clf.score(X_train,y_train)
testscore = clf.score(X_test,y_test)
print("trainscore:{},testscore:{}".format(trainscore, testscore))
# 預測
y_pred = clf.predict(X_test)

trainscore:1.0,testscore:0.975

顯示圖像測試集圖像

# plt.figure(figsize=(12,6), dpi=100)
plt.subplot(1,1,1)
plt.imshow(X_test[1].reshape(64,64), cmap=plt.cm.gray)

<matplotlib.image.AxesImage at 0x21fb6d83688>

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

預測是正确的，可以發現svm的預測效果非常好

True

其中PCA模型的explained_variance_ratio變量可以擷取經PCA處理後的資料還原率

from sklearn.decomposition import PCA

pca = PCA(n_components=140)
X_pca = pca.fit_transform(X)
np.sum(pca.explained_variance_ratio_)

0.9585573

現在使用的是4096個特征，現在使用PCA對特征進行降維，再檢視圖像的變化;

from sklearn.decomposition import PCA

# 原圖展示
plt.figure(figsize=(12,8), dpi=100)
subimage = faces.images[:5]
for i in range(5):
    plt.subplot(1, 5, i+1)
    plt.imshow(subimage[i], cmap=plt.cm.gray)
    plt.axis('off')
    
# 降維後的圖檔展示
k = [140, 75, 37, 19, 8]
plt.figure(figsize=(12,12), dpi=100)

for index in range(len(k)):
    pca = PCA(n_components=k[index])

    # 進行降維處理
    X_pca = pca.fit_transform(X)
    # 重新升維，中間過程有損耗
    X_invert_pca = pca.inverse_transform(X_pca)
    image = X_invert_pca.reshape(-1,64,64)
    subimage = image[:5]
    
    for i in range(len(k)):
        plt.subplot(len(k), 5, (i+1)+len(k)*index)
        plt.imshow(subimage[i], cmap=plt.cm.gray)
    #     plt.title("name:"+target_names[i])
        plt.axis('off')

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

可以看見降維後的人臉逐漸模糊，從4096特征次元講到140次元還是可以保持臉部的大部分特征

https://zhuanlan.zhihu.com/p/271969151 關于 fit(), transform(), fit_transform()差別，這篇部落格有介紹

必須先用fit_transform(trainData)，之後再transform(testData)。如果直接transform(testData)，程式會報錯

如果fit_transfrom(trainData)後，使用fit_transform(testData)而不transform(testData)，雖然也能歸一化，但是兩個結果不是在同一個“标準”下的，具有明顯差異。也就是我們需要用處理訓練集的歸一化過程來處理測試集，確定有相同的資料處理。

from sklearn.svm import SVC

# 設定多降到的次元
pca = PCA(n_components=140)

# 先使用訓練集對進行訓練與歸一化處理
X_train_pca = pca.fit_transform(X_train)
# 然後對測試采用訓練集同樣的參數進行歸一化處理
X_test_pca = pca.transform(X_test)

# 指定SVC的class_weight參數，讓SVC模型能根據訓練樣本的數量來均衡地調整權重
clf = SVC(class_weight='balanced')
# 用歸一化後的資料給svm進行訓練
clf.fit(X_train_pca, y_train)

# 計算得分
trainscore = clf.score(X_train_pca,y_train)
testscore = clf.score(X_test_pca,y_test)
print("trainscore:{},testscore:{}".format(trainscore, testscore))

trainscore:1.0,testscore:0.975

使用GridSearchCV來進一步篩選

from sklearn.model_selection import GridSearchCV

# print("Searching the best parameters for SVC ...")
param_grid = {'C': [1, 5, 10, 50, 100],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01]}
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid, verbose=2, n_jobs=4)
clf = clf.fit(X_train_pca, y_train)
print("Best parameters found by grid search:",clf.best_params_)

# 計算得分
trainscore = clf.score(X_train_pca,y_train)
testscore = clf.score(X_test_pca,y_test)
print("trainscore:{},testscore:{}".format(trainscore, testscore))

Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best parameters found by grid search: {'C': 5, 'gamma': 0.005}
trainscore:1.0,testscore:0.9625

可以看見效果還是非常不錯的

import pandas as pd

result = pd.DataFrame()
result['pred'] = y_pred
result['true'] = y_test
result['compares'] = y_pred==y_test
result.head(10)

pred	true	compares
18	18	True
1	True
2	6	6	True
3	31	31	True
4	10	10	True
5	27	27	True
6	36	36	True
7	32	32	True
8	29	29	True
9	33	33	True

機器學習原理與實戰 | PCA降維實踐1. PCA介紹2. PCA 算法模拟3. 執行個體：pca進行人臉降維

文章目錄

1. PCA介紹

1.1 概念

1.2 降維及恢複示意圖

2. PCA 算法模拟

2.1 Numpy實作

2.2 sklearn 包實作

3. 執行個體：pca進行人臉降維

繼續閱讀

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入