Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

大家好，我是Mac Jiang，今天和大家分享Coursera-Stanford University-Machine Learning-Programming Exercise 7:K-means Clustering and Principal Principal Component Analysis的第一部分的編碼。第一部分講的是K-means Clustering,即K均值算法的實作過程，雖然我寫的代碼是正确的，但不一定是最好的，如果有更好的實作方法，請留言指正。當然，歡迎大家轉載我的部落格，不過在轉載之前請标明出處，謝謝。第二部分的位址為：http://blog.csdn.net/a1015553840/article/details/50879343

好的，我們開始講解第一部分K-means Clustering的具體實作過程。

這部分的主要有兩大塊内容：

(1)主要是訓練PCA算法，并在OpenGL上繪制出K均值算法的具體計算過程，繪制出每次分類情況和中心變換情況。

Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

(2)利用K均值算法對一幅圖像進行壓縮，此圖像為128*128，每個像素由RGB三種顔色辨別，而每種顔色用1BYTE（8bit）表示，範圍為0-255。如果不采取壓縮，那麼圖像所占存儲空間大小為128*128*3BYTE=128*128*24bits = 393,216bits。我們要進行的是利用K均值算法聚類出最常用的16中顔色，這16中顔色隻要用4bit辨別，加上這十六種顔色與RGB的映射關系共128*128*4 + 16*24 = 65,920bit。可以看到，壓縮後存儲隻占壓縮前存儲量的1/6左右。

Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

資料集：ex7data2.mat---用于訓練K均值算法的訓練樣本

bird_small.png---用于做壓縮測試的圖像

函數：displayData.m---把訓練樣本X的資料可視化

drawLine.m---畫出2D降為1D的直線 plotDataPoints.m---k均值算法的點，當屬于不同中心時用不同顔色畫出

plotProgresskMeans.m---做出k均值算法的中心 runMeans.m---運作k均值算法

ex7.m---K均值算法的主要制函數，控制算法的進行過程

kMeansInitCentroid.m---初始化k均值算法的中心，需要完善代碼！

findClosestCentroids.m---将每個樣本歸為離他最近的中心的那一類，需要完善代碼！

computeCentroids.m---将上面求得的類，計算每一類的新的中心，需要完善代碼！

這部分作業共三個檔案需要完善代碼

K均值算法的計算為：

初始化中心；（kMeansInitCentroids.m實作）

Repeat{

from 1 to m:計算每個樣本離各類中心的距離，将每個樣本分别歸類（findClosestCentroids.m實作）

from 1 to K:z在歸類後，計算各類的中心（compureCentroids.m實作）

}

這我們需要完成的任務就是編寫初始化，樣本分類，求新分類中心三個操作

1.ex7的控制過程

%% Machine Learning Online Class
%  Exercise 7 | Principle Component Analysis and K-Means Clustering
%
%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  exercise. You will need to complete the following functions:
%
%     pca.m
%     projectData.m
%     recoverData.m
%     computeCentroids.m
%     findClosestCentroids.m
%     kMeansInitCentroids.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%

%% Initialization
clear ; close all; clc

%% ================= Part 1: Find Closest Centroids ====================
%  To help you implement K-Means, we have divided the learning algorithm 
%  into two functions -- findClosestCentroids and computeCentroids. In this
%  part, you shoudl complete the code in the findClosestCentroids function. 
%
fprintf('Finding closest centroids.\n\n');

% Load an example dataset that we will be using
load('ex7data2.mat');

% Select an initial set of centroids
K = 3; % 3 Centroids
initial_centroids = [3 3; 6 2; 8 5];

% Find the closest centroids for the examples using the
% initial_centroids
idx = findClosestCentroids(X, initial_centroids);

fprintf('Closest centroids for the first 3 examples: \n')
fprintf(' %d', idx(1:3));
fprintf('\n(the closest centroids should be 1, 3, 2 respectively)\n');

fprintf('Program paused. Press enter to continue.\n');
pause;

%% ===================== Part 2: Compute Means =========================
%  After implementing the closest centroids function, you should now
%  complete the computeCentroids function.
%
fprintf('\nComputing centroids means.\n\n');

%  Compute means based on the closest centroids found in the previous part.
centroids = computeCentroids(X, idx, K);

fprintf('Centroids computed after initial finding of closest centroids: \n')
fprintf(' %f %f \n' , centroids');
fprintf('\n(the centroids should be\n');
fprintf('   [ 2.428301 3.157924 ]\n');
fprintf('   [ 5.813503 2.633656 ]\n');
fprintf('   [ 7.119387 3.616684 ]\n\n');

fprintf('Program paused. Press enter to continue.\n');
pause;


%% =================== Part 3: K-Means Clustering ======================
%  After you have completed the two functions computeCentroids and
%  findClosestCentroids, you have all the necessary pieces to run the
%  kMeans algorithm. In this part, you will run the K-Means algorithm on
%  the example dataset we have provided. 
%
fprintf('\nRunning K-Means clustering on example dataset.\n\n');

% Load an example dataset
load('ex7data2.mat');

% Settings for running K-Means
K = 3;
max_iters = 10;

% For consistency, here we set centroids to specific values
% but in practice you want to generate them automatically, such as by
% settings them to be random examples (as can be seen in
% kMeansInitCentroids).
initial_centroids = [3 3; 6 2; 8 5];

% Run K-Means algorithm. The 'true' at the end tells our function to plot
% the progress of K-Means
[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);
fprintf('\nK-Means Done.\n\n');

fprintf('Program paused. Press enter to continue.\n');
pause;

%% ============= Part 4: K-Means Clustering on Pixels ===============
fprintf('\nRunning K-Means clustering on pixels from an image.\n\n');
%  Load an image of a bird
A = double(imread('bird_small.png'));
A = A / 255; % Divide by 255 so that all values are in the range 0 - 1

% 圖檔為128行，128列，每個像素RGB三種顔色，每個顔色1Byte = 8bit，共128*128*24bits
img_size = size(A);

%原圖A為img_size(1)行，img_size(2)列，每個像素點的顔色由RGB三種表示，每種8bit共3位元組，故為img_size(1)*img_size(2)*3
%由于我們要使用K-means，是以我們要把行和列鋪平，成為一個響亮，成為一個長度為img_size(1)*img_size(2)的向量，每個元素有RGB三種共3位元組
%把圖像鋪平，這樣每個元素即為一個輸入x，他有RGB三個次元
X = reshape(A, img_size(1) * img_size(2), 3);

%我們的目的是把RGB共256*256*256種顔色壓縮成16種顔色，這16種顔色是通過K均值算法計算出來的
%假如不壓縮，原圖為128*128*3Byte = 128*128*24bit = 393216bits;
%如果壓縮成16種顔色，那麼隻要4bit表示顔色的種類，然後再記錄用到的這16種顔色的RGB表示16*24bits...
%共16*24 +%128*128*4 = 65920bit8,圖像壓縮了将近6倍
K = 16; 
max_iters = 10;
%初始化中心
initial_centroids = kMeansInitCentroids(X, K);
%運作K均值算法
[centroids, idx] = runkMeans(X, initial_centroids, max_iters);

fprintf('Program paused. Press enter to continue.\n');
pause;


%% ================= Part 5: Image Compression ======================
fprintf('\nApplying K-Means to compress an image.\n\n');
% Find closest cluster members
idx = findClosestCentroids(X, centroids);

% Essentially, now we have represented the image X as in terms of the
% indices in idx. 

% We can now recover the image from the indices (idx) by mapping each pixel
% (specified by it's index in idx) to the centroid value
X_recovered = centroids(idx,:);

% Reshape the recovered image into proper dimensions
%本實驗本身并未壓圖檔，最後隻是把各點顔色用那16種代替了而已，但是提供的是一種壓縮圖檔的思想
X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);

% Display the original image 
subplot(1, 2, 1);
imagesc(A); 
title('Original');

% Display compressed image side by side
subplot(1, 2, 2);
imagesc(X_recovered)
title(sprintf('Compressed, with %d colors.', K));

fprintf('Program paused. Press enter to continue.\n');
pause;

Part1:Find Closest Centroids---利用ex7data2.mat和目前中心，計算的每個樣本離每個中心的距離，将他們分為最近的中心的類别中

Part2:Compute Means---利用第一部分的到的新的分類，計算每個新的分類的中心

Part3:K-Means Clustering---利用K均值算法進行聚類，并畫出每次聚類的類别變化過程和新中心的轉變過程

Part4:K-Means Clustering Pixels---利用K均值算法對圖像進行聚類分析，找到16中使用最多的顔色】

Park5:Image Compressing---在part4得到的16種顔色的基礎上，對圖像進行壓縮。這裡實際上并未對圖像進行壓縮，而是把圖檔各顔色換成16中顔色内與之相近的顔色。這裡隻是給我們提供這種圖檔壓縮的方法，并未最終實作

2.kMeansInitCentroids.m的實作

function centroids = kMeansInitCentroids(X, K)
%KMEANSINITCENTROIDS This function initializes K centroids that are to be 
%used in K-Means on the dataset X
%   centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be
%   used with the K-Means on the dataset X
%
% You should return this values correctly
centroids = zeros(K, size(X, 2));
% ====================== YOUR CODE HERE ======================
% Instructions: You should set centroids to randomly chosen examples from
%               the dataset X
%
%初始化中心centroids，從X中随機取K行作為初始化中心
randidx = randperm(size(X,1));            %打亂X的行，列不變
centroids = X(randidx(1:K),:);            %從打亂的X中取前K個作為初始化中心
% =============================================================
end

初始化中心時，是随機選取訓練樣本X中的K個作為初始化中心，是以先打亂X，然後取前K個即可。

3.findcloestCentroids.m的實作

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% ====================== YOUR CODE HERE ======================
% Instructions: Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the 
%               range 1..K
%
% Note: You can use a for-loop over the examples to compute this.
%
temp = zeros(K,1);                                           %存儲樣本x離各個中心距離的距離，友善求解該x離哪個點最近
for i = 1:size(X,1),                                         %對X的每個樣本進行周遊
    for j = 1:K,                                             %在進行x(i)時候，計算他離每個中心的距離，存儲在temp中
        temp(j) = sum((X(i,:) -  centroids(j,:)).^2);        
        [value,idx(i)] = min(temp,[],1);                     %計算temp中最小值的行号，就是x(i)距離最近的中心标号
    end
end
% =============================================================
end

4.computeCentroids.m的實作

function centroids = computeCentroids(X, idx, K)
%COMPUTECENTROIDS returs the new centroids by computing the means of the 
%data points assigned to each centroid.
%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by 
%   computing the means of the data points assigned to each centroid. It is
%   given a dataset X where each row is a single data point, a vector
%   idx of centroid assignments (i.e. each entry in range [1..K]) for each
%   example, and K, the number of centroids. You should return a matrix
%   centroids, where each row of centroids is the mean of the data points
%   assigned to it.
%
% Useful variables
[m n] = size(X);
% You need to return the following variables correctly.
centroids = zeros(K, n);
% ====================== YOUR CODE HERE ======================
% Instructions: Go over every centroid and compute mean of all points that
%               belong to it. Concretely, the row vector centroids(i, :)
%               should contain the mean of the data points assigned to
%               centroid i.
%
% Note: You can use a for-loop over the centroids to compute this.
%
for i = 1:K,                                              %對每個中心周遊，一個一個計算
    centroids(i,:) = (X' * (idx == i)) / sum(idx == i);   %矩陣的方法，idx == i的意思是是idx向量的元素為i的位置置1，不為i的置0；
                                                          %然後乘以X’就是把對應中心i的X值加起來，最後除以sum及求平均
                                                          %這裡實際上可以再采用一個for循環計算centroids，但是向量的方法更快，故采取向量的方法
end
% =============================================================
end

這裡對每個類别進行周遊（共K類），然後對每類計算他的中心。對每個類别進行周遊時候需要一個FOR循環；在計算每個類别的中心時也可以采取一個for循環，但是這樣太慢，可以采取向量的方法加快計算速度。

向量的方法即利用（idx==i）得到一個m*1的向量，當idx對應的位置為i時，此向量對應位置為1，否則為0。

FROM:http://blog.csdn.net/a1015553840/article/details/50877623

Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

繼續閱讀

Coursera Machine Learning 學習筆記（二）

Coursera Machine Learning 學習筆記（十一）

Coursera Machine Learning 學習筆記（四）

Coursera Machine Learning 學習筆記（九）

誤差逆傳播計算

Seq2Seq模型中的貪心搜尋(Greedy Search)1. 引入2. 貪心算法：Greedy Search參考

機器學習的應用場景

機器學習 Machine Learning

Machine Learning Basis

R與bioconductor--ExpressionSet SummarizedExperiment GEOquery biomaRt S4-Classes S4-Methods

Coursera Machine Learning如何送出MATLAB Online作業

Text Recognition with ML KitText Recognition with ML Kit

【吳恩達機器學習筆記】7支援向量機12支援向量機（Support Vector Machines）

scikit-learn中的SVM

ML - 貸款使用者逾期情況分析6 - Final思路

SVM支援向量機二（Lagrange Duality）SVM支援向量機二（Lagrange Duality）