Spark常見的Transformation算子（一）

`parallelize`

将一個存在的集合，轉換成一個RDD

/** Distribute a local Scala collection to form an RDD.
 *
 * @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
 * to parallelize and before the first action on the RDD, the resultant RDD will reflect the
 * modified collection. Pass a copy of the argument to avoid this.
 * @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
 * RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
 * @param seq Scala collection to distribute
 * @param numSlices number of partitions to divide the collection into
 * @return RDD representing distributed collection
 */
// 可以指定分區數量，如果不指定，使用預設的分區數量為主機核心數
def parallelize[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  assertNotStopped()
  new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}

Scala版本

println("======================= parallelize-1 ===========================")
val data: RDD[Int] = sc.parallelize(1 to 10)
println(s"分區數量為：${data.getNumPartitions}")
println(s"原始資料為：${data.collect.toBuffer}")

println("======================= parallelize-2 ===========================")
val message: RDD[String] = sc.parallelize(List("hello world", "hello spark", "hello scala"), 2)
println(s"分區數量為：${message.getNumPartitions}")
println(s"原始資料為：${message.collect.toBuffer}")

運作結果

Spark常見的Transformation算子（一）

`makeRDD`

将一個存在的集合，轉換成一個RDD

/** Distribute a local Scala collection to form an RDD.
 *
 * This method is identical to `parallelize`.
 * @param seq Scala collection to distribute
 * @param numSlices number of partitions to divide the collection into
 * @return RDD representing distributed collection
 */
// 第一種 makeRDD 實作，底層調用了 parallelize 函數
def makeRDD[T: ClassTag](
    seq: Seq[T],
    numSlices: Int = defaultParallelism): RDD[T] = withScope {
  parallelize(seq, numSlices)
}

/**
 * Distribute a local Scala collection to form an RDD, with one or more
 * location preferences (hostnames of Spark nodes) for each object.
 * Create a new partition for each collection item.
 * @param seq list of tuples of data and location preferences (hostnames of Spark nodes)
 * @return RDD representing data partitioned according to location preferences
 */
// 第二種 makeRDD 實作，可以為資料提供位置資訊，但是不可以指定分區的數量，而是固定為seq參數的size大小
def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
  assertNotStopped()
  val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
  new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 1), indexToPrefs)
}

Scala版本

println("======================= makeRDD-1 ===========================")
val data: RDD[Int] = sc.makeRDD(1 to 10)
println(s"分區數量為：${data.getNumPartitions}")
println(s"原始資料為：${data.collect.toBuffer}")

println("======================= makeRDD-2 ===========================")
val message: RDD[String] = sc.makeRDD(List("hello world", "hello spark", "hello scala"), 2)
println(s"分區數量為：${message.getNumPartitions}")
println(s"原始資料為：${message.collect.toBuffer}")

println("======================= makeRDD-3 ===========================")
val info: RDD[Int] = sc.makeRDD(List((1, List("aa", "bb")), (2, List("cc", "dd")), (3, List("ee", "ff"))))
println(s"分區數量為：${info.getNumPartitions}")
println(s"原始資料為：${info.collect.toBuffer}")
println(s"第一分區的資料為：${info.preferredLocations(info.partitions(0))}")

運作結果

Spark常見的Transformation算子（一）

`textFile`

從外部讀取資料來建立RDD

/**
 * Read a text file from HDFS, a local file system (available on all nodes), or any
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 * @param path path to the text file on a supported file system
 * @param minPartitions suggested minimum number of partitions for the resulting RDD
 * @return RDD of lines of the text file
 */
// 兩個參數，其中一個有預設值，最小的預設分區數量為：2
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

Scala版本

println("======================= textFile-1 ===========================")
val data: RDD[String] = sc.textFile("src/main/data/textFile.txt")
println(s"分區數量為：${data.getNumPartitions}")
println(s"原始資料為：${data.collect.toBuffer}")

println("======================= textFile-2 ===========================")
val message: RDD[String] = sc.textFile("src/main/data/textFile.txt", 3)
println(s"分區數量為：${message.getNumPartitions}")
println(s"原始資料為：${message.collect.toBuffer}")

運作結果

Spark常見的Transformation算子（一）

Spark常見的Transformation算子（一）

Spark常見的Transformation算子（一）

`parallelize`

`makeRDD`

`textFile`

繼續閱讀

pyspark調用spark以及執行帶in語句參數的hql示例

用寫sql的思路寫 pyspark

pyspark學習(一)—pyspark的安裝與基礎文法一 Pysaprk的安裝二：pyspark的簡單文法END

【Spark Mllib】K-均值聚類——電影類型K-均值聚類資料特征提取

一篇文章讓你精通Java JSP規範

世界因大資料而改變

Spark的RDD轉換算子-雙value型Spark的RDD轉換算子-雙value型

SparkSQL項目練習1 準備資料2 需求：各區域熱門商品Top3

延雲行業搜尋資料庫在大資料生态中位置和重要性大資料的挑戰大資料技術的現狀延雲行業搜尋資料庫

Spark在windows環境裡跑時報錯找不到org.apache.hadoop.fs.FSDataInputStream

Spark流式分析系統實作流式實時日志分析系統

Scala和Java二種方式實戰Spark Streaming開發

Spark基礎:Spark簡介及特點,運作模式,安裝Spark,Driver與Executor,Local模式,Standalone模式,Yarn模式,Mesos模式,WordCount案例,HA配置第1章 Spark概述第2章 Spark運作模式第3章案例實操

Spark實作wordcount

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結