Spark MLlib KMeans聚類算法

1.1 KMeans聚類算法

1.1.1 基礎理論

KMeans算法的基本思想是初始随機給定K個簇中心，按照最鄰近原則把待分類樣本點分到各個簇。然後按平均法重新計算各個簇的質心，進而确定新的簇心。一直疊代，直到簇心的移動距離小于某個給定的值。

K-Means聚類算法主要分為三個步驟：

(1)第一步是為待聚類的點尋找聚類中心；

(2)第二步是計算每個點到聚類中心的距離，将每個點聚類到離該點最近的聚類中去；

(3)第三步是計算每個聚類中所有點的坐标平均值，并将這個平均值作為新的聚類中心；

反複執行(2)、(3)，直到聚類中心不再進行大範圍移動或者聚類次數達到要求為止。

1.1.2過程示範

下圖展示了對n個樣本點進行K-means聚類的效果，這裡k取2：

(a)未聚類的初始點集；

(b)随機選取兩個點作為聚類中心；

(c)計算每個點到聚類中心的距離，并聚類到離該點最近的聚類中去；

(d)計算每個聚類中所有點的坐标平均值，并将這個平均值作為新的聚類中心；

(e)重複(c),計算每個點到聚類中心的距離，并聚類到離該點最近的聚類中去；

(f)重複(d),計算每個聚類中所有點的坐标平均值，并将這個平均值作為新的聚類中心。

Spark MLlib KMeans聚類算法

參照以下文檔：

http://blog.sina.com.cn/s/blog_62186b46010145ne.html

1.2 Spark Mllib KMeans源碼分析

class KMeansprivate (

privatevar k: Int,

privatevar maxIterations: Int,

privatevar runs: Int,

privatevar initializationMode: String,

privatevar initializationSteps: Int,

privatevar epsilon: Double,

privatevar seed: Long)extends Serializablewith Logging {

// KMeans類參數：

k:聚類個數，預設2；maxIterations：疊代次數，預設20；runs：并行度，預設1；

initializationMode：初始中心算法，預設"k-means||"；initializationSteps：初始步長，預設5；epsilon：中心距離門檻值，預設1e-4；seed：随機種子。

defthis() =this(2,20, 1, KMeans.K_MEANS_PARALLEL,5, 1e-4, Utils.random.nextLong())

// 參數設定

def setK(k: Int):this.type = {

this.k = k

this

}

**省略各個參數設定代碼**

// run方法，KMeans主入口函數

def run(data: RDD[Vector]): KMeansModel = {

if (data.getStorageLevel == StorageLevel.NONE) {

logWarning("The input data is not directly cached, which may hurt performance if its"

+ " parent RDDs are also uncached.")

}

// Compute squared norms and cache them.

// 計算每行資料的L2範數，資料轉換：data[Vector]=> data[(Vector, norms)]，其中norms是Vector的L2範數，norms就是：。

val norms = data.map(Vectors.norm(_,2.0))

norms.persist()

val zippedData = data.zip(norms).map {case (v, norm) =>

new VectorWithNorm(v, norm)

}

val model = runAlgorithm(zippedData)

norms.unpersist()

// Warn at the end of the run as well, for increased visibility.

if (data.getStorageLevel == StorageLevel.NONE) {

logWarning("The input data was not directly cached, which may hurt performance if its"

+ " parent RDDs are also uncached.")

}

model

}

// runAlgorithm方法，KMeans實作方法。

privatedef runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {

val sc = data.sparkContext

val initStartTime = System.nanoTime()

val centers =if (initializationMode == KMeans.RANDOM) {

initRandom(data)

} else {

initKMeansParallel(data)

}

val initTimeInSeconds = (System.nanoTime() - initStartTime) /1e9

logInfo(s"Initialization with $initializationMode took " +"%.3f".format(initTimeInSeconds) +

" seconds.")

val active = Array.fill(runs)(true)

val costs = Array.fill(runs)(0.0)

var activeRuns =new ArrayBuffer[Int] ++ (0 until runs)

var iteration =0

val iterationStartTime = System.nanoTime()

//KMeans疊代執行，計算每個樣本屬于哪個中心點，中心點累加樣本的值及計數，然後根據中心點的所有的樣本資料進行中心點的更新，并比較更新前的數值，判斷是否完成。其中runs代表并行度。

// Execute iterations of Lloyd's algorithm until all runs have converged

while (iteration < maxIterations && !activeRuns.isEmpty) {

type WeightedPoint = (Vector, Long)

def mergeContribs(x: WeightedPoint, y: WeightedPoint): WeightedPoint = {

axpy(1.0, x._1, y._1)

(y._1, x._2 + y._2)

}

val activeCenters = activeRuns.map(r => centers(r)).toArray

val costAccums = activeRuns.map(_ => sc.accumulator(0.0))

val bcActiveCenters = sc.broadcast(activeCenters)

// Find the sum and count of points mapping to each center

//計算屬于每個中心點的樣本，對每個中心點的樣本進行累加和計算；

runs代表并行度，k中心點個數，sums代表中心點樣本累加值，counts代表中心點樣本計數；

contribs代表（（并行度I，中心J），（中心J樣本之和，中心J樣本計數和））；

findClosest方法：找到點與所有聚類中心最近的一個中心；

val totalContribs = data.mapPartitions { points =>

val thisActiveCenters = bcActiveCenters.value

val runs = thisActiveCenters.length

val k = thisActiveCenters(0).length

val dims = thisActiveCenters(0)(0).vector.size

val sums = Array.fill(runs, k)(Vectors.zeros(dims))

val counts = Array.fill(runs, k)(0L)

points.foreach { point =>

(0 until runs).foreach { i =>

val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point)

costAccums(i) += cost

val sum = sums(i)(bestCenter)

axpy(1.0, point.vector, sum)

counts(i)(bestCenter) += 1

}

val contribs =for (i <-0 until runs; j <-0 until k) yield {

((i, j), (sums(i)(j), counts(i)(j)))

}

contribs.iterator

}.reduceByKey(mergeContribs).collectAsMap()

//更新中心點，更新中心點= sum/count；

判斷newCenter與centers之間的距離是否 > epsilon * epsilon;

// Update the cluster centers and costs for each active run

for ((run, i) <- activeRuns.zipWithIndex) {

var changed =false

var j =0

while (j < k) {

val (sum, count) = totalContribs((i, j))

if (count !=0) {

scal(1.0 / count, sum)

val newCenter =new VectorWithNorm(sum)

if (KMeans.fastSquaredDistance(newCenter, centers(run)(j)) > epsilon * epsilon) {

changed = true

}

centers(run)(j) = newCenter

}

j += 1

}

if (!changed) {

active(run) = false

logInfo("Run " + run +" finished in " + (iteration +1) + " iterations")

}

costs(run) = costAccums(i).value

}

activeRuns = activeRuns.filter(active(_))

iteration += 1

}

val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) /1e9

logInfo(s"Iterations took " +"%.3f".format(iterationTimeInSeconds) +" seconds.")

if (iteration == maxIterations) {

logInfo(s"KMeans reached the max number of iterations: $maxIterations.")

} else {

logInfo(s"KMeans converged in $iteration iterations.")

}

val (minCost, bestRun) = costs.zipWithIndex.min

logInfo(s"The cost for the best run is $minCost.")

new KMeansModel(centers(bestRun).map(_.vector))

}

//findClosest方法：找到點與所有聚類中心最近的一個中心；

private[mllib]def findClosest(

centers: TraversableOnce[VectorWithNorm],

point: VectorWithNorm): (Int, Double) = {

var bestDistance = Double.PositiveInfinity

var bestIndex =0

var i =0

centers.foreach { center =>

// Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary

// distance computation.

var lowerBoundOfSqDist = center.norm - point.norm

lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist

if (lowerBoundOfSqDist < bestDistance) {

val distance: Double = fastSquaredDistance(center, point)

if (distance < bestDistance) {

bestDistance = distance

bestIndex = i

}

i += 1

}

(bestIndex, bestDistance)

}

findClosest方法中：var lowerBoundOfSqDist = center.norm - point.norm

lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist

如果中心點center是(a1,b1)，需要計算的點point是(a2,b2)，那麼lowerBoundOfSqDist是：

Spark MLlib KMeans聚類算法

如下是展開式，第二個是真正計算歐式距離時的除去開平方的公式。（在查找最短距離的時候無需計算開方，因為隻需要計算出開方裡面的式子就可以進行比較了，mllib也是這樣做的）

Spark MLlib KMeans聚類算法

可輕易證明上面兩式的第一式将會小于等于第二式，是以在進行距離比較的時候，先計算很容易計算的lowerBoundOfSqDist，如果lowerBoundOfSqDist都不小于之前計算得到的最小距離bestDistance，那真正的歐式距離也不可能小于bestDistance了，是以這種情況下就不需要去計算歐式距離，省去很多計算工作。

如果lowerBoundOfSqDist小于了bestDistance，則進行距離的計算，調用fastSquaredDistance，這個方法将調用MLUtils.scala裡面的fastSquaredDistance方法，計算真正的歐式距離，代碼如下：

private[mllib]def fastSquaredDistance(

v1: Vector,

norm1: Double,

v2: Vector,

norm2: Double,

precision: Double = 1e-6): Double = {

val n = v1.size

require(v2.size == n)

require(norm1 >= 0.0 && norm2 >=0.0)

val sumSquaredNorm = norm1 * norm1 + norm2 * norm2

val normDiff = norm1 - norm2

var sqDist =0.0

val precisionBound1 =2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)

if (precisionBound1 < precision) {

sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)

} elseif (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {

val dotValue = dot(v1, v2)

sqDist = math.max(sumSquaredNorm - 2.0 * dotValue,0.0)

val precisionBound2 = EPSILON * (sumSquaredNorm +2.0 * math.abs(dotValue)) /

(sqDist + EPSILON)

if (precisionBound2 > precision) {

sqDist = Vectors.sqdist(v1, v2)

}

} else {

sqDist = Vectors.sqdist(v1, v2)

}

sqDist

}

fastSquaredDistance方法會先計算一個精度，有關精度的計算val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)，如果在精度滿足條件的情況下，歐式距離sqDist = sumSquaredNorm - 2.0 * v1.dot(v2)，sumSquaredNorm即為

Spark MLlib KMeans聚類算法

，2.0 * v1.dot(v2)即為

Spark MLlib KMeans聚類算法

。這也是之前将norm計算出來的好處。如果精度不滿足要求，則進行原始的距離計算公式了

Spark MLlib KMeans聚類算法

，即調用Vectors.sqdist(v1, v2)。

1.3 Mllib KMeans執行個體

1、資料

資料格式為：特征1 特征2 特征3

0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

9.0 9.0 9.0

9.1 9.1 9.1

9.2 9.2 9.2

2、代碼

//1讀取樣本資料

valdata_path ="/home/jb-huangmeiling/kmeans_data.txt"

valdata =sc.textFile(data_path)

valexamples =data.map { line =>

Vectors.dense(line.split(' ').map(_.toDouble))

}.cache()

valnumExamples =examples.count()

println(s"numExamples = $numExamples.")

//2建立模型

valk =2

valmaxIterations =20

valruns =2

valinitializationMode ="k-means||"

valmodel = KMeans.train(examples,k, maxIterations,runs, initializationMode)

//3計算測試誤差

valcost =model.computeCost(examples)

println(s"Total cost = $cost.")

Spark MLlib KMeans聚類算法

1.1 KMeans聚類算法

1.1.1 基礎理論

1.1.2過程示範

1.2 Spark Mllib KMeans源碼分析

1.3 Mllib KMeans執行個體

繼續閱讀

用寫sql的思路寫 pyspark

pyspark學習(一)—pyspark的安裝與基礎文法一 Pysaprk的安裝二：pyspark的簡單文法END

【Spark Mllib】K-均值聚類——電影類型K-均值聚類資料特征提取

一篇文章讓你精通Java JSP規範

世界因大資料而改變

Spark的RDD轉換算子-雙value型Spark的RDD轉換算子-雙value型

SparkSQL項目練習1 準備資料2 需求：各區域熱門商品Top3

圖神經網絡中可能用到的11種距離, 小結

延雲行業搜尋資料庫在大資料生态中位置和重要性大資料的挑戰大資料技術的現狀延雲行業搜尋資料庫

Spark在windows環境裡跑時報錯找不到org.apache.hadoop.fs.FSDataInputStream

Spark流式分析系統實作流式實時日志分析系統

Scala和Java二種方式實戰Spark Streaming開發

Spark基礎:Spark簡介及特點,運作模式,安裝Spark,Driver與Executor,Local模式,Standalone模式,Yarn模式,Mesos模式,WordCount案例,HA配置第1章 Spark概述第2章 Spark運作模式第3章案例實操

Spark實作wordcount

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結