
Spark mllib 保序回歸



原始序列:<9, 10, 14>

結果序列:<9, 10, 14>


原始序列:<9, 14, 10>

結果序列:<9, 12, 12>

分析:從9往後觀察,觀察到14時發生亂序(14>10),停止該輪觀察轉入吸收元素處理,吸收元素10後子序列為<14, 10>,取該序列所有元素的平均值得12,故用序列<12, 12>替代<14, 10>。吸收10後已經到了最後的元素,處理操作完成。

原始序列:<14, 9, 10, 15>

結果序列:<11, 11, 11, 15>

分析:從14往後觀察,觀察到9時發生亂序(14>9),停止該輪觀察轉入吸收元素處理,吸收元素9後子序列為<14,9>。求該序列所有元素的平均值得12.5,由于12.5大于下個待吸收的元素10,是以再吸收10,得序列<14, 9, 10>。求該序列所有元素的平均值得11,由于11小于下個待吸收的元素15,是以停止吸收操作,用序列<11, 11, 11>替代<14, 9, 10>。

package com.immooc.spark

import org.apache.log4j.{Level, Logger}

import org.apache.spark.mllib.regression.IsotonicRegression

import org.apache.spark.{SparkConf, SparkContext}

object Isotonic_Regression {

def main(args:Array[String]): Unit = {

val conf = new SparkConf().setAppName("Isotonic_Regression").setMaster("local[2]")

val sc = new SparkContext(conf)


val data = sc.textFile("file:///Users/walle/Documents/D3/sparkmlib/sample_isotonic_regression_data.txt")

val parsedData = data.map{


val parts = line.split(',').map(_.toDouble)

(parts(0), parts(1), 1.0)


val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0)

val test = splits(1)

val model = new IsotonicRegression().setIsotonic(true).run(training)

val x = model.boundaries

val y = model.predictions

println("boundaries" + "\t" + "predictions")

for (i <- 0 to x.length -1){

println(x(i) + "\t" + y(i))


val predictionAndLabel = test.map{

point =>

val predictedLabel = model.predict(point._2)

(predictedLabel, point._1)


val print_predict = predictionAndLabel.collect

println("prediction" + "\t" + "label")

for (i <- 0 to print_predict.length - 1) {

println(print_predict(i)._1 + "\t" + print_predict(i)._2)


val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p - l), 2) }.mean()

println("Mean Squared Error = " + meanSquaredError)



3. Result

boundaries predictions

0.01 0.168689444

0.17 0.168689444

0.18 0.19545421571428573

0.27 0.19545421571428573

0.28 0.20040796

0.3 0.43396226

0.31 0.5275369700000001

0.32 0.5275369700000001

0.35 0.54156043

0.36 0.5602243760000001

0.41 0.5602243760000001

0.44 0.5936596972222222

0.76 0.5936596972222222

0.77 0.64762876

0.79 0.6807751133333334

0.82 0.6807751133333334

0.83 0.73890872

0.84 0.73992861

0.86 0.89673636

0.87 0.89673636

0.9 0.93115757

0.95 0.93472718

1.0 0.93472718

prediction label

0.168689444 0.31208567

0.168689444 0.35900051

0.168689444 0.03926568

0.168689444 0.12952575

0.168689444 0.0

0.168689444 0.01376849

0.168689444 0.13105558

0.19545421571428573 0.13717491

0.19545421571428573 0.19020908

0.19545421571428573 0.19581846

0.31718510999999966 0.29576747

0.5322114566666667 0.4854666

0.5368859433333334 0.49209587

0.5602243760000001 0.5017848

0.5713694830740741 0.58286588

0.5825145901481482 0.64660887

0.5936596972222222 0.65782764

0.5936596972222222 0.63029067

0.5936596972222222 0.65323814

0.5936596972222222 0.67006629

0.5936596972222222 0.51555329

0.5936596972222222 0.33299337

0.5936596972222222 0.36206017

0.5936596972222222 0.4309026

0.5936596972222222 0.48393677

0.5936596972222222 0.48495665

0.5936596972222222 0.4518103

0.5936596972222222 0.47118817

0.5936596972222222 0.58031617

0.5936596972222222 0.55481897

0.5936596972222222 0.56603774

0.6642019366666667 0.66241713

0.6807751133333334 0.65119837

0.818332485 0.84242733

0.9082100966666666 0.90719021

0.9196838333333334 0.93115757

0.931871492 0.91942886

0.932585414 0.9291178

0.933299336 0.95665477

0.934013258 0.9500255

0.93472718 0.89801122

0.93472718 0.90311066

0.93472718 0.9036206

Mean Squared Error = 0.008860256490591361


