天天看点

[Spark进阶]--map 和 flatMap 简要说明

1、举例说明

先看一下例子,输入2行数据:

val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue"))  // lines

rdd.collect

    res0: Array[String] = Array("Roses are red", "Violets are blue")      

现在,使用map将一个长度为N的RDD转换成另一个长度为N的RDD。

例如,map操作计算这两行的长度:

rdd.map(_.length).collect

    res1: Array[Int] = Array(13, 16)      

再看flatMap将一个RDD的长度N转化为N collection的集合,然后趋于平缓的到一个RDD的结果。

rdd.flatMap(_.split(" ")).collect

    res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")      

每行,我们有多个单词和多行,但是我们最终得到一个输出数组的单词。

2、总结

(1)flatMap 和 map 两个 transforms 操作集合的区别

/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}      
/**
 *  Return a new RDD by first applying a function to all elements of this
 *  RDD, and then flattening the results.
 */
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}      

(2)转换函数:

map: One element in -> one element out.

flatMap: One element in -> 0 or more elements out (a collection).

参考:

​​https://stackoverflow.com/questions/22350722/what-is-the-difference-between-map-and-flatmap-and-a-good-use-case-for-each​​