天天看點

Spark MLlib KMeans聚類算法

1.1 KMeans聚類算法

1.1.1 基礎理論

KMeans算法的基本思想是初始随機給定K個簇中心,按照最鄰近原則把待分類樣本點分到各個簇。然後按平均法重新計算各個簇的質心,進而确定新的簇心。一直疊代,直到簇心的移動距離小于某個給定的值。

K-Means聚類算法主要分為三個步驟:

(1)第一步是為待聚類的點尋找聚類中心;

(2)第二步是計算每個點到聚類中心的距離,将每個點聚類到離該點最近的聚類中去;

(3)第三步是計算每個聚類中所有點的坐标平均值,并将這個平均值作為新的聚類中心;

反複執行(2)、(3),直到聚類中心不再進行大範圍移動或者聚類次數達到要求為止。

1.1.2過程示範

下圖展示了對n個樣本點進行K-means聚類的效果,這裡k取2:

(a)未聚類的初始點集;

(b)随機選取兩個點作為聚類中心;

(c)計算每個點到聚類中心的距離,并聚類到離該點最近的聚類中去;

(d)計算每個聚類中所有點的坐标平均值,并将這個平均值作為新的聚類中心;

(e)重複(c),計算每個點到聚類中心的距離,并聚類到離該點最近的聚類中去;

(f)重複(d),計算每個聚類中所有點的坐标平均值,并将這個平均值作為新的聚類中心。

Spark MLlib KMeans聚類算法

參照以下文檔:

http://blog.sina.com.cn/s/blog_62186b46010145ne.html

1.2 Spark Mllib KMeans源碼分析

class KMeansprivate (

    privatevar k: Int,

    privatevar maxIterations: Int,

    privatevar runs: Int,

    privatevar initializationMode: String,

    privatevar initializationSteps: Int,

    privatevar epsilon: Double,

    privatevar seed: Long)extends Serializablewith Logging {

// KMeans類參數:

k:聚類個數,預設2;maxIterations:疊代次數,預設20;runs:并行度,預設1;

initializationMode:初始中心算法,預設"k-means||";initializationSteps:初始步長,預設5;epsilon:中心距離門檻值,預設1e-4;seed:随機種子。

  defthis() =this(2,20, 1, KMeans.K_MEANS_PARALLEL,5, 1e-4, Utils.random.nextLong())

// 參數設定

  def setK(k: Int):this.type = {

    this.k = k

    this

  }

**省略各個參數設定代碼**

// run方法,KMeans主入口函數

  def run(data: RDD[Vector]): KMeansModel = {

    if (data.getStorageLevel == StorageLevel.NONE) {

      logWarning("The input data is not directly cached, which may hurt performance if its"

        + " parent RDDs are also uncached.")

    }

// Compute squared norms and cache them.

// 計算每行資料的L2範數,資料轉換:data[Vector]=> data[(Vector, norms)],其中norms是Vector的L2範數,norms就是:。

    val norms = data.map(Vectors.norm(_,2.0))

    norms.persist()

    val zippedData = data.zip(norms).map {case (v, norm) =>

      new VectorWithNorm(v, norm)

    }

    val model = runAlgorithm(zippedData)

    norms.unpersist()

    // Warn at the end of the run as well, for increased visibility.

    if (data.getStorageLevel == StorageLevel.NONE) {

      logWarning("The input data was not directly cached, which may hurt performance if its"

        + " parent RDDs are also uncached.")

    }

    model

  }

// runAlgorithm方法,KMeans實作方法。

  privatedef runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {

    val sc = data.sparkContext

    val initStartTime = System.nanoTime()

    val centers =if (initializationMode == KMeans.RANDOM) {

      initRandom(data)

    } else {

      initKMeansParallel(data)

    }

    val initTimeInSeconds = (System.nanoTime() - initStartTime) /1e9

    logInfo(s"Initialization with $initializationMode took " +"%.3f".format(initTimeInSeconds) +

      " seconds.")

    val active = Array.fill(runs)(true)

    val costs = Array.fill(runs)(0.0)

    var activeRuns =new ArrayBuffer[Int] ++ (0 until runs)

    var iteration =0

    val iterationStartTime = System.nanoTime()

//KMeans疊代執行,計算每個樣本屬于哪個中心點,中心點累加樣本的值及計數,然後根據中心點的所有的樣本資料進行中心點的更新,并比較更新前的數值,判斷是否完成。其中runs代表并行度。

    // Execute iterations of Lloyd's algorithm until all runs have converged

    while (iteration < maxIterations && !activeRuns.isEmpty) {

      type WeightedPoint = (Vector, Long)

      def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = {

        axpy(1.0, x._1, y._1)

        (y._1, x._2 + y._2)

      }

      val activeCenters = activeRuns.map(r => centers(r)).toArray

      val costAccums = activeRuns.map(_ => sc.accumulator(0.0))

      val bcActiveCenters = sc.broadcast(activeCenters)

      // Find the sum and count of points mapping to each center

//計算屬于每個中心點的樣本,對每個中心點的樣本進行累加和計算;

runs代表并行度,k中心點個數,sums代表中心點樣本累加值,counts代表中心點樣本計數;

contribs代表((并行度I,中心J),(中心J樣本之和,中心J樣本計數和));

findClosest方法:找到點與所有聚類中心最近的一個中心;

      val totalContribs = data.mapPartitions { points =>

        val thisActiveCenters = bcActiveCenters.value

        val runs = thisActiveCenters.length

        val k = thisActiveCenters(0).length

        val dims = thisActiveCenters(0)(0).vector.size

        val sums = Array.fill(runs, k)(Vectors.zeros(dims))

        val counts = Array.fill(runs, k)(0L)

        points.foreach { point =>

          (0 until runs).foreach { i =>

           val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point)

           costAccums(i) += cost

           val sum = sums(i)(bestCenter)

           axpy(1.0, point.vector, sum)

           counts(i)(bestCenter) += 1

          }

        }

        val contribs =for (i <-0 until runs; j <-0 until k) yield {

          ((i, j), (sums(i)(j), counts(i)(j)))

        }

        contribs.iterator

      }.reduceByKey(mergeContribs).collectAsMap()

//更新中心點,更新中心點= sum/count;

判斷newCenter與centers之間的距離是否 > epsilon * epsilon;

      // Update the cluster centers and costs for each active run

      for ((run, i) <- activeRuns.zipWithIndex) {

        var changed =false

        var j =0

        while (j < k) {

          val (sum, count) = totalContribs((i, j))

          if (count !=0) {

           scal(1.0 / count, sum)

           val newCenter =new VectorWithNorm(sum)

           if (KMeans.fastSquaredDistance(newCenter, centers(run)(j)) > epsilon * epsilon) {

             changed = true

           }

           centers(run)(j) = newCenter

          }

          j += 1

        }

        if (!changed) {

          active(run) = false

          logInfo("Run " + run +" finished in " + (iteration +1) + " iterations")

        }

        costs(run) = costAccums(i).value

      }

      activeRuns = activeRuns.filter(active(_))

      iteration += 1

    }

    val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) /1e9

    logInfo(s"Iterations took " +"%.3f".format(iterationTimeInSeconds) +" seconds.")

    if (iteration == maxIterations) {

      logInfo(s"KMeans reached the max number of iterations: $maxIterations.")

    } else {

      logInfo(s"KMeans converged in $iteration iterations.")

    }

    val (minCost, bestRun) = costs.zipWithIndex.min

    logInfo(s"The cost for the best run is $minCost.")

    new KMeansModel(centers(bestRun).map(_.vector))

  }

//findClosest方法:找到點與所有聚類中心最近的一個中心;

  private[mllib]def findClosest(

      centers: TraversableOnce[VectorWithNorm],

      point: VectorWithNorm): (Int, Double) = {

    var bestDistance = Double.PositiveInfinity

    var bestIndex =0

    var i =0

    centers.foreach { center =>

      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary

      // distance computation.

      var lowerBoundOfSqDist = center.norm - point.norm

      lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist

      if (lowerBoundOfSqDist < bestDistance) {

        val distance: Double = fastSquaredDistance(center, point)

        if (distance < bestDistance) {

          bestDistance = distance

          bestIndex = i

        }

      }

      i += 1

    }

    (bestIndex, bestDistance)

  }

findClosest方法中:var lowerBoundOfSqDist = center.norm - point.norm

lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist

如果中心點center是(a1,b1),需要計算的點point是(a2,b2),那麼lowerBoundOfSqDist是:

Spark MLlib KMeans聚類算法

如下是展開式,第二個是真正計算歐式距離時的除去開平方的公式。(在查找最短距離的時候無需計算開方,因為隻需要計算出開方裡面的式子就可以進行比較了,mllib也是這樣做的)

Spark MLlib KMeans聚類算法

可輕易證明上面兩式的第一式将會小于等于第二式,是以在進行距離比較的時候,先計算很容易計算的lowerBoundOfSqDist,如果lowerBoundOfSqDist都不小于之前計算得到的最小距離bestDistance,那真正的歐式距離也不可能小于bestDistance了,是以這種情況下就不需要去計算歐式距離,省去很多計算工作。

如果lowerBoundOfSqDist小于了bestDistance,則進行距離的計算,調用fastSquaredDistance,這個方法将調用MLUtils.scala裡面的fastSquaredDistance方法,計算真正的歐式距離,代碼如下:

  private[mllib]def fastSquaredDistance(

      v1: Vector,

      norm1: Double,

      v2: Vector,

      norm2: Double,

      precision: Double = 1e-6): Double = {

    val n = v1.size

    require(v2.size == n)

    require(norm1 >= 0.0 && norm2 >=0.0)

    val sumSquaredNorm = norm1 * norm1 + norm2 * norm2

    val normDiff = norm1 - norm2

    var sqDist =0.0

    val precisionBound1 =2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)

    if (precisionBound1 < precision) {

      sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)

    } elseif (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {

      val dotValue = dot(v1, v2)

      sqDist = math.max(sumSquaredNorm - 2.0 * dotValue,0.0)

      val precisionBound2 = EPSILON * (sumSquaredNorm +2.0 * math.abs(dotValue)) /

        (sqDist + EPSILON)

      if (precisionBound2 > precision) {

        sqDist = Vectors.sqdist(v1, v2)

      }

    } else {

      sqDist = Vectors.sqdist(v1, v2)

    }

    sqDist

  }

fastSquaredDistance方法會先計算一個精度,有關精度的計算val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON),如果在精度滿足條件的情況下,歐式距離sqDist = sumSquaredNorm - 2.0 * v1.dot(v2),sumSquaredNorm即為

Spark MLlib KMeans聚類算法

,2.0 * v1.dot(v2)即為

Spark MLlib KMeans聚類算法

。這也是之前将norm計算出來的好處。如果精度不滿足要求,則進行原始的距離計算公式了

Spark MLlib KMeans聚類算法

,即調用Vectors.sqdist(v1, v2)。

1.3 Mllib KMeans執行個體

1、資料

資料格式為:特征1 特征2 特征3

0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

9.0 9.0 9.0

9.1 9.1 9.1

9.2 9.2 9.2

2、代碼

  //1讀取樣本資料

  valdata_path ="/home/jb-huangmeiling/kmeans_data.txt"

  valdata =sc.textFile(data_path)

  valexamples =data.map { line =>

    Vectors.dense(line.split(' ').map(_.toDouble))

  }.cache()

  valnumExamples =examples.count()

  println(s"numExamples = $numExamples.")

  //2建立模型

  valk =2

  valmaxIterations =20

  valruns =2

  valinitializationMode ="k-means||"

  valmodel = KMeans.train(examples,k, maxIterations,runs, initializationMode)

  //3計算測試誤差

  valcost =model.computeCost(examples)

  println(s"Total cost = $cost.")

繼續閱讀