天天看點

Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

        大家好,我是Mac Jiang,今天和大家分享Coursera-Stanford University-Machine Learning-Programming Exercise 7:K-means Clustering and Principal Principal Component Analysis的第一部分的編碼。第一部分講的是K-means Clustering,即K均值算法的實作過程,雖然我寫的代碼是正确的,但不一定是最好的,如果有更好的實作方法,請留言指正。當然,歡迎大家轉載我的部落格,不過在轉載之前請标明出處,謝謝。第二部分的位址為:http://blog.csdn.net/a1015553840/article/details/50879343

        好的,我們開始講解第一部分K-means Clustering的具體實作過程。

       這部分的主要有兩大塊内容:

       (1)主要是訓練PCA算法,并在OpenGL上繪制出K均值算法的具體計算過程,繪制出每次分類情況和中心變換情況。

Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

       (2)利用K均值算法對一幅圖像進行壓縮,此圖像為128*128,每個像素由RGB三種顔色辨別,而每種顔色用1BYTE(8bit)表示,範圍為0-255。如果不采取壓縮,那麼圖像所占存儲空間大小為128*128*3BYTE=128*128*24bits = 393,216bits。我們要進行的是利用K均值算法聚類出最常用的16中顔色,這16中顔色隻要用4bit辨別,加上這十六種顔色與RGB的映射關系共128*128*4 + 16*24 = 65,920bit。可以看到,壓縮後存儲隻占壓縮前存儲量的1/6左右。

Programming Exercise 7:K-means Clustering and Principal Component Analysis 第一部分

資料集:ex7data2.mat---用于訓練K均值算法的訓練樣本

              bird_small.png---用于做壓縮測試的圖像

   函數:displayData.m---把訓練樣本X的資料可視化

              drawLine.m---畫出2D降為1D的直線  plotDataPoints.m---k均值算法的點,當屬于不同中心時用不同顔色畫出

              plotProgresskMeans.m---做出k均值算法的中心    runMeans.m---運作k均值算法

              ex7.m---K均值算法的主要制函數,控制算法的進行過程

              kMeansInitCentroid.m---初始化k均值算法的中心,需要完善代碼!

              findClosestCentroids.m---将每個樣本歸為離他最近的中心的那一類,需要完善代碼!

              computeCentroids.m---将上面求得的類,計算每一類的新的中心,需要完善代碼!

              這部分作業共三個檔案需要完善代碼

K均值算法的計算為:

        初始化中心;(kMeansInitCentroids.m實作)

        Repeat{

                    from 1 to m:計算每個樣本離各類中心的距離,将每個樣本分别歸類(findClosestCentroids.m實作)

                    from 1 to K:z在歸類後,計算各類的中心(compureCentroids.m實作)

         }

這我們需要完成的任務就是編寫初始化,樣本分類,求新分類中心三個操作

1.ex7的控制過程

%% Machine Learning Online Class
%  Exercise 7 | Principle Component Analysis and K-Means Clustering
%
%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  exercise. You will need to complete the following functions:
%
%     pca.m
%     projectData.m
%     recoverData.m
%     computeCentroids.m
%     findClosestCentroids.m
%     kMeansInitCentroids.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%

%% Initialization
clear ; close all; clc

%% ================= Part 1: Find Closest Centroids ====================
%  To help you implement K-Means, we have divided the learning algorithm 
%  into two functions -- findClosestCentroids and computeCentroids. In this
%  part, you shoudl complete the code in the findClosestCentroids function. 
%
fprintf('Finding closest centroids.\n\n');

% Load an example dataset that we will be using
load('ex7data2.mat');

% Select an initial set of centroids
K = 3; % 3 Centroids
initial_centroids = [3 3; 6 2; 8 5];

% Find the closest centroids for the examples using the
% initial_centroids
idx = findClosestCentroids(X, initial_centroids);

fprintf('Closest centroids for the first 3 examples: \n')
fprintf(' %d', idx(1:3));
fprintf('\n(the closest centroids should be 1, 3, 2 respectively)\n');

fprintf('Program paused. Press enter to continue.\n');
pause;

%% ===================== Part 2: Compute Means =========================
%  After implementing the closest centroids function, you should now
%  complete the computeCentroids function.
%
fprintf('\nComputing centroids means.\n\n');

%  Compute means based on the closest centroids found in the previous part.
centroids = computeCentroids(X, idx, K);

fprintf('Centroids computed after initial finding of closest centroids: \n')
fprintf(' %f %f \n' , centroids');
fprintf('\n(the centroids should be\n');
fprintf('   [ 2.428301 3.157924 ]\n');
fprintf('   [ 5.813503 2.633656 ]\n');
fprintf('   [ 7.119387 3.616684 ]\n\n');

fprintf('Program paused. Press enter to continue.\n');
pause;


%% =================== Part 3: K-Means Clustering ======================
%  After you have completed the two functions computeCentroids and
%  findClosestCentroids, you have all the necessary pieces to run the
%  kMeans algorithm. In this part, you will run the K-Means algorithm on
%  the example dataset we have provided. 
%
fprintf('\nRunning K-Means clustering on example dataset.\n\n');

% Load an example dataset
load('ex7data2.mat');

% Settings for running K-Means
K = 3;
max_iters = 10;

% For consistency, here we set centroids to specific values
% but in practice you want to generate them automatically, such as by
% settings them to be random examples (as can be seen in
% kMeansInitCentroids).
initial_centroids = [3 3; 6 2; 8 5];

% Run K-Means algorithm. The 'true' at the end tells our function to plot
% the progress of K-Means
[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);
fprintf('\nK-Means Done.\n\n');

fprintf('Program paused. Press enter to continue.\n');
pause;

%% ============= Part 4: K-Means Clustering on Pixels ===============
fprintf('\nRunning K-Means clustering on pixels from an image.\n\n');
%  Load an image of a bird
A = double(imread('bird_small.png'));
A = A / 255; % Divide by 255 so that all values are in the range 0 - 1

% 圖檔為128行,128列,每個像素RGB三種顔色,每個顔色1Byte = 8bit,共128*128*24bits
img_size = size(A);

%原圖A為img_size(1)行,img_size(2)列,每個像素點的顔色由RGB三種表示,每種8bit共3位元組,故為img_size(1)*img_size(2)*3
%由于我們要使用K-means,是以我們要把行和列鋪平,成為一個響亮,成為一個長度為img_size(1)*img_size(2)的向量,每個元素有RGB三種共3位元組
%把圖像鋪平,這樣每個元素即為一個輸入x,他有RGB三個次元
X = reshape(A, img_size(1) * img_size(2), 3);

%我們的目的是把RGB共256*256*256種顔色壓縮成16種顔色,這16種顔色是通過K均值算法計算出來的
%假如不壓縮,原圖為128*128*3Byte = 128*128*24bit = 393216bits;
%如果壓縮成16種顔色,那麼隻要4bit表示顔色的種類,然後再記錄用到的這16種顔色的RGB表示16*24bits...
%共16*24 +%128*128*4 = 65920bit8,圖像壓縮了将近6倍
K = 16; 
max_iters = 10;
%初始化中心
initial_centroids = kMeansInitCentroids(X, K);
%運作K均值算法
[centroids, idx] = runkMeans(X, initial_centroids, max_iters);

fprintf('Program paused. Press enter to continue.\n');
pause;


%% ================= Part 5: Image Compression ======================
fprintf('\nApplying K-Means to compress an image.\n\n');
% Find closest cluster members
idx = findClosestCentroids(X, centroids);

% Essentially, now we have represented the image X as in terms of the
% indices in idx. 

% We can now recover the image from the indices (idx) by mapping each pixel
% (specified by it's index in idx) to the centroid value
X_recovered = centroids(idx,:);

% Reshape the recovered image into proper dimensions
%本實驗本身并未壓圖檔,最後隻是把各點顔色用那16種代替了而已,但是提供的是一種壓縮圖檔的思想
X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);

% Display the original image 
subplot(1, 2, 1);
imagesc(A); 
title('Original');

% Display compressed image side by side
subplot(1, 2, 2);
imagesc(X_recovered)
title(sprintf('Compressed, with %d colors.', K));

fprintf('Program paused. Press enter to continue.\n');
pause;

           

        Part1:Find Closest Centroids---利用ex7data2.mat和目前中心,計算的每個樣本離每個中心的距離,将他們分為最近的中心的類别中

        Part2:Compute Means---利用第一部分的到的新的分類,計算每個新的分類的中心

        Part3:K-Means Clustering---利用K均值算法進行聚類,并畫出每次聚類的類别變化過程和新中心的轉變過程

        Part4:K-Means Clustering Pixels---利用K均值算法對圖像進行聚類分析,找到16中使用最多的顔色】

        Park5:Image Compressing---在part4得到的16種顔色的基礎上,對圖像進行壓縮。這裡實際上并未對圖像進行壓縮,而是把圖檔各顔色換成16中顔色内與之相近的顔色。這裡隻是給我們提供這種圖檔壓縮的方法,并未最終實作

2.kMeansInitCentroids.m的實作

function centroids = kMeansInitCentroids(X, K)
%KMEANSINITCENTROIDS This function initializes K centroids that are to be 
%used in K-Means on the dataset X
%   centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be
%   used with the K-Means on the dataset X
%
% You should return this values correctly
centroids = zeros(K, size(X, 2));
% ====================== YOUR CODE HERE ======================
% Instructions: You should set centroids to randomly chosen examples from
%               the dataset X
%
%初始化中心centroids,從X中随機取K行作為初始化中心
randidx = randperm(size(X,1));            %打亂X的行,列不變
centroids = X(randidx(1:K),:);            %從打亂的X中取前K個作為初始化中心
% =============================================================
end

           

         初始化中心時,是随機選取訓練樣本X中的K個作為初始化中心,是以先打亂X,然後取前K個即可。

3.findcloestCentroids.m的實作

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% ====================== YOUR CODE HERE ======================
% Instructions: Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the 
%               range 1..K
%
% Note: You can use a for-loop over the examples to compute this.
%
temp = zeros(K,1);                                           %存儲樣本x離各個中心距離的距離,友善求解該x離哪個點最近
for i = 1:size(X,1),                                         %對X的每個樣本進行周遊
    for j = 1:K,                                             %在進行x(i)時候,計算他離每個中心的距離,存儲在temp中
        temp(j) = sum((X(i,:) -  centroids(j,:)).^2);        
        [value,idx(i)] = min(temp,[],1);                     %計算temp中最小值的行号,就是x(i)距離最近的中心标号
    end
end
% =============================================================
end
           

4.computeCentroids.m的實作

function centroids = computeCentroids(X, idx, K)
%COMPUTECENTROIDS returs the new centroids by computing the means of the 
%data points assigned to each centroid.
%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by 
%   computing the means of the data points assigned to each centroid. It is
%   given a dataset X where each row is a single data point, a vector
%   idx of centroid assignments (i.e. each entry in range [1..K]) for each
%   example, and K, the number of centroids. You should return a matrix
%   centroids, where each row of centroids is the mean of the data points
%   assigned to it.
%
% Useful variables
[m n] = size(X);
% You need to return the following variables correctly.
centroids = zeros(K, n);
% ====================== YOUR CODE HERE ======================
% Instructions: Go over every centroid and compute mean of all points that
%               belong to it. Concretely, the row vector centroids(i, :)
%               should contain the mean of the data points assigned to
%               centroid i.
%
% Note: You can use a for-loop over the centroids to compute this.
%
for i = 1:K,                                              %對每個中心周遊,一個一個計算
    centroids(i,:) = (X' * (idx == i)) / sum(idx == i);   %矩陣的方法,idx == i的意思是是idx向量的元素為i的位置置1,不為i的置0;
                                                          %然後乘以X’就是把對應中心i的X值加起來,最後除以sum及求平均
                                                          %這裡實際上可以再采用一個for循環計算centroids,但是向量的方法更快,故采取向量的方法
end
% =============================================================
end

           

        這裡對每個類别進行周遊(共K類),然後對每類計算他的中心。對每個類别進行周遊時候需要一個FOR循環;在計算每個類别的中心時也可以采取一個for循環,但是這樣太慢,可以采取向量的方法加快計算速度。

        向量的方法即利用(idx==i)得到一個m*1的向量,當idx對應的位置為i時,此向量對應位置為1,否則為0。

FROM:http://blog.csdn.net/a1015553840/article/details/50877623

繼續閱讀