天天看點

【Machine Learning】K-means算法及C語言實作

聚類算法是一種無監督的分類方法,即樣本預先不知所屬類别或标簽,需要根據樣本之間的距離或相似程度自動進行分類。聚類算法可以分為基于劃分的方法、基于連通性的方法、基于密度的方法、基于機率分布模型的方法等, K-means(K均值)屬于基于劃分的聚類方法。

一、基本原理 基于劃分的聚類方法是将樣本集組成的矢量空間劃分為多個區域

【Machine Learning】K-means算法及C語言實作

,每個區域都存在一個區域相關的表示

【Machine Learning】K-means算法及C語言實作

,即區域中心。對于每個樣本可以建立一種樣本到區域中心的映射

【Machine Learning】K-means算法及C語言實作

【Machine Learning】K-means算法及C語言實作

其中l()為指數函數。 根據建立的映射q(x),可以将相應的樣本分類到相應的中心

【Machine Learning】K-means算法及C語言實作

,得到最終的劃分結果。

不同的基于劃分的聚類算法的主要差別在于如何建立相應的映射方式q(x)。在經典的K-means聚類算法中,映射是通過樣本與各中心的之間的距離平方和最小準則來确立的。

假設有樣本集合

【Machine Learning】K-means算法及C語言實作

, K-means聚類算法的目标是将資料集劃分為k(k<n)類:S = {S1, S2, ..., SK},使劃分後的K個子集合滿足類内的距離平方和最小:

【Machine Learning】K-means算法及C語言實作

其中,

【Machine Learning】K-means算法及C語言實作

求解目标函數

【Machine Learning】K-means算法及C語言實作

是一個NP-hard問題,無法保證得到一個穩定的全局最優解。在經典的聚類算法中,采取疊代優化政策,有效地求解目标函數的局部最優解。

算法步驟如下: 步驟1  初始化聚類中心

【Machine Learning】K-means算法及C語言實作

,可選取樣本集的前k個樣本,或者随機選取k個樣本; 步驟2  配置設定各樣本

【Machine Learning】K-means算法及C語言實作

到相近的聚類集合,樣本配置設定依據為:

【Machine Learning】K-means算法及C語言實作

           式 中 i = 1,2, ...,k,p ≠ j。 步驟3  根據步驟2的配置設定結果,更新聚類中心:

【Machine Learning】K-means算法及C語言實作

步驟4 若疊代達到最大疊代步數,或前後兩次疊代的差小于設定門檻值

【Machine Learning】K-means算法及C語言實作

,即

【Machine Learning】K-means算法及C語言實作

,則疊代終止,否則重複步驟2。

其中,步驟2和步驟3分别對樣本集合重新配置設定和更新計算聚類中心,通過疊代計算過程中優化目标函數

【Machine Learning】K-means算法及C語言實作

,實作類内距離平方和最小。

二、K-means算法的優化

2.1 聚類中心初始化的優化 K-means對聚類中心的初始化比較敏感,不同初始化值會帶來不同的聚類結果,這是因為K-means僅僅對目标函數

【Machine Learning】K-means算法及C語言實作

求取近似局部最優解,不能保證得到全局最優解,即在一定資料分布下聚類結果會因為初始化的不同而産生很大的偏差。

下面介紹一下K-means的改進算法,即K-means++算法,改算法能夠有效産生初始的聚類中心。 首先,随機初始化一個聚類中心

【Machine Learning】K-means算法及C語言實作

; 然後,通過疊代計算最大機率值:

【Machine Learning】K-means算法及C語言實作

加入下一個聚類中心:

【Machine Learning】K-means算法及C語言實作

直到選擇k個中心。

K-means++算法的計算複雜度為O(knd),沒有增加過多的計算負擔,同時可以保證算法更有效的近似于最優解。

2.2 類别個數的自适應确定 經典的K-means算法中,聚類的個數k是預先确定的,不具備自适應選擇類别個數的能力。而聚類算法中類别個數的設定将會在很大程度上決定聚類效果。 ISODATA算法與K-means在基本原則上是一緻的,通過計算距離平方和最小來實作聚類,但在疊代的過程中會引入類别的合并與分離機制。 在每一次疊代中,ISODATA算法首先在固定類别的情況下進行聚類,然後根據設定樣本之間的距離門檻值進行合并操作,并根據每一組類别Si中樣本協方差矩陣資訊來判斷是否分開。 但IOSDATA算法的效率會相比于K-means大大降低。

附:K-means算法C語言實作

#include <math.h>
#include <stdlib.h>
#include <stdio.h>

#define sqr(x) ((x)*(x))

#define MAX_CLUSTERS 16

#define MAX_ITERATIONS 100

#define BIG_double (INFINITY)

void fail(char *str)
  {
    printf(str);
    exit(-1);
  }
  
double calc_distance(int dim, double *p1, double *p2)
  {
    double distance_sq_sum = 0;
    
    for (int ii = 0; ii < dim; ii++)
      distance_sq_sum += sqr(p1[ii] - p2[ii]);

    return distance_sq_sum;
    
  }

void calc_all_distances(int dim, int n, int k, double *X, double *centroid, double *distance_output)
  {
    for (int ii = 0; ii < n; ii++) // for each point
      for (int jj = 0; jj < k; jj++) // for each cluster
        {
         // calculate distance between point and cluster centroid
          distance_output[ii*k + jj] = calc_distance(dim, &X[ii*dim], ¢roid[jj*dim]);
        }
  }
  
double calc_total_distance(int dim, int n, int k, double *X, double *centroids, int *cluster_assignment_index)
 // NOTE: a point with cluster assignment -1 is ignored
  {
    double tot_D = 0;
    
   // for every point
    for (int ii = 0; ii < n; ii++)
      {
       // which cluster is it in?
        int active_cluster = cluster_assignment_index[ii];
        
       // sum distance
        if (active_cluster != -1)
          tot_D += calc_distance(dim, &X[ii*dim], ¢roids[active_cluster*dim]);
      }
      
    return tot_D;
  }

void choose_all_clusters_from_distances(int dim, int n, int k, double *distance_array, int *cluster_assignment_index)
  {
   // for each point
    for (int ii = 0; ii < n; ii++)
      {
        int best_index = -1;
        double closest_distance = BIG_double;
        
       // for each cluster
        for (int jj = 0; jj < k; jj++)
          {
           // distance between point and cluster centroid
           
            double cur_distance = distance_array[ii*k + jj];
            if (cur_distance < closest_distance)
              {
                best_index = jj;
                closest_distance = cur_distance;
              }
          }

       // record in array
        cluster_assignment_index[ii] = best_index;
      }
  }

void calc_cluster_centroids(int dim, int n, int k, double *X, int *cluster_assignment_index, double *new_cluster_centroid)
  {
    int cluster_member_count[MAX_CLUSTERS];
  
   // initialize cluster centroid coordinate sums to zero
    for (int ii = 0; ii < k; ii++) 
      {
        cluster_member_count[ii] = 0;
        
        for (int jj = 0; jj < dim; jj++)
          new_cluster_centroid[ii*dim + jj] = 0;
     }

   // sum all points
   // for every point
    for (int ii = 0; ii < n; ii++)
      {
       // which cluster is it in?
        int active_cluster = cluster_assignment_index[ii];

       // update count of members in that cluster
        cluster_member_count[active_cluster]++;
        
       // sum point coordinates for finding centroid
        for (int jj = 0; jj < dim; jj++)
          new_cluster_centroid[active_cluster*dim + jj] += X[ii*dim + jj];
      }
      
      
   // now divide each coordinate sum by number of members to find mean/centroid
   // for each cluster
    for (int ii = 0; ii < k; ii++) 
      {
        if (cluster_member_count[ii] == 0)
          printf("WARNING: Empty cluster %d! \n", ii);
          
       // for each dimension
        for (int jj = 0; jj < dim; jj++)
          new_cluster_centroid[ii*dim + jj] /= cluster_member_count[ii];  /// XXXX will divide by zero here for any empty clusters!

      }
  }

void get_cluster_member_count(int n, int k, int *cluster_assignment_index, int *cluster_member_count)
  {
   // initialize cluster member counts
    for (int ii = 0; ii < k; ii++) 
      cluster_member_count[ii] = 0;
  
   // count members of each cluster    
    for (int ii = 0; ii < n; ii++)
      cluster_member_count[cluster_assignment_index[ii]]++;
  }

void update_delta_score_table(int dim, int n, int k, double *X, int *cluster_assignment_cur, double *cluster_centroid, int *cluster_member_count, double *point_move_score_table, int cc)
  {
   // for every point (both in and not in the cluster)
    for (int ii = 0; ii < n; ii++)
      {
        double dist_sum = 0;
        for (int kk = 0; kk < dim; kk++)
          {
            double axis_dist = X[ii*dim + kk] - cluster_centroid[cc*dim + kk]; 
            dist_sum += sqr(axis_dist);
          }
          
        double mult = ((double)cluster_member_count[cc] / (cluster_member_count[cc] + ((cluster_assignment_cur[ii]==cc) ? -1 : +1)));

        point_move_score_table[ii*dim + cc] = dist_sum * mult;
      }
  }
  
  
void  perform_move(int dim, int n, int k, double *X, int *cluster_assignment, double *cluster_centroid, int *cluster_member_count, int move_point, int move_target_cluster)
  {
    int cluster_old = cluster_assignment[move_point];
    int cluster_new = move_target_cluster;
  
   // update cluster assignment array
    cluster_assignment[move_point] = cluster_new;
    
   // update cluster count array
    cluster_member_count[cluster_old]--;
    cluster_member_count[cluster_new]++;
    
    if (cluster_member_count[cluster_old] <= 1)
      printf("WARNING: Can't handle single-member clusters! \n");
    
   // update centroid array
    for (int ii = 0; ii < dim; ii++)
      {
        cluster_centroid[cluster_old*dim + ii] -= (X[move_point*dim + ii] - cluster_centroid[cluster_old*dim + ii]) / cluster_member_count[cluster_old];
        cluster_centroid[cluster_new*dim + ii] += (X[move_point*dim + ii] - cluster_centroid[cluster_new*dim + ii]) / cluster_member_count[cluster_new];
      }
  }  
  
void cluster_diag(int dim, int n, int k, double *X, int *cluster_assignment_index, double *cluster_centroid)
  {
    int cluster_member_count[MAX_CLUSTERS];
    
    get_cluster_member_count(n, k, cluster_assignment_index, cluster_member_count);
     
    printf("  Final clusters \n");
    for (int ii = 0; ii < k; ii++) 
      printf("    cluster %d:     members: %8d, centroid (%.1f %.1f) \n", ii, cluster_member_count[ii], cluster_centroid[ii*dim + 0], cluster_centroid[ii*dim + 1]);
  }

void copy_assignment_array(int n, int *src, int *tgt)
  {
    for (int ii = 0; ii < n; ii++)
      tgt[ii] = src[ii];
  }
  
int assignment_change_count(int n, int a[], int b[])
  {
    int change_count = 0;

    for (int ii = 0; ii < n; ii++)
      if (a[ii] != b[ii])
        change_count++;
        
    return change_count;
  }

void kmeans(
            int  dim,		                     // dimension of data 

            double *X,                        // pointer to data
            int   n,                         // number of elements
            
            int   k,                         // number of clusters
            double *cluster_centroid,         // initial cluster centroids
            int   *cluster_assignment_final  // output
           )
  {
    double *dist                    = (double *)malloc(sizeof(double) * n * k);
    int   *cluster_assignment_cur  = (int *)malloc(sizeof(int) * n);
    int   *cluster_assignment_prev = (int *)malloc(sizeof(int) * n);
    double *point_move_score        = (double *)malloc(sizeof(double) * n * k);
    
    
    if (!dist || !cluster_assignment_cur || !cluster_assignment_prev || !point_move_score)
      fail("Error allocating dist arrays");
    
   // initial setup  
    calc_all_distances(dim, n, k, X, cluster_centroid, dist);
    choose_all_clusters_from_distances(dim, n, k, dist, cluster_assignment_cur);
    copy_assignment_array(n, cluster_assignment_cur, cluster_assignment_prev);

   // BATCH UPDATE
    double prev_totD = BIG_double;
    int batch_iteration = 0;
    while (batch_iteration < MAX_ITERATIONS)
      {
//        printf("batch iteration %d \n", batch_iteration);
//        cluster_diag(dim, n, k, X, cluster_assignment_cur, cluster_centroid);
        
        // update cluster centroids
         calc_cluster_centroids(dim, n, k, X, cluster_assignment_cur, cluster_centroid);

        // deal with empty clusters
        // XXXXXXXXXXXXXX

        // see if we've failed to improve
         double totD = calc_total_distance(dim, n, k, X, cluster_centroid, cluster_assignment_cur);
         if (totD > prev_totD)
          // failed to improve - currently solution worse than previous
           {
            // restore old assignments
             copy_assignment_array(n, cluster_assignment_prev, cluster_assignment_cur);
             
            // recalc centroids
             calc_cluster_centroids(dim, n, k, X, cluster_assignment_cur, cluster_centroid);
             
             printf("  negative progress made on this step - iteration completed (%.2f) \n", totD - prev_totD);
             
            // done with this phase
             break;
           }
           
        // save previous step
         copy_assignment_array(n, cluster_assignment_cur, cluster_assignment_prev);
         
        // move all points to nearest cluster
         calc_all_distances(dim, n, k, X, cluster_centroid, dist);
         choose_all_clusters_from_distances(dim, n, k, dist, cluster_assignment_cur);
         
         int change_count = assignment_change_count(n, cluster_assignment_cur, cluster_assignment_prev);
         
         printf("%3d   %u   %9d  %16.2f %17.2f\n", batch_iteration, 1, change_count, totD, totD - prev_totD);
         fflush(stdout);
         
        // done with this phase if nothing has changed
         if (change_count == 0)
           {
             printf("  no change made on this step - iteration completed \n");
             break;
           }

         prev_totD = totD;
                        
         batch_iteration++;
      }

   // write to output array
    copy_assignment_array(n, cluster_assignment_cur, cluster_assignment_final);    
    
    free(dist);
    free(cluster_assignment_cur);
    free(cluster_assignment_prev);
    free(point_move_score);
  }  
           

2017.11.17

繼續閱讀