1、举例说明
先看一下例子,输入2行数据:
val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue")) // lines
rdd.collect
res0: Array[String] = Array("Roses are red", "Violets are blue")
现在,使用map将一个长度为N的RDD转换成另一个长度为N的RDD。
例如,map操作计算这两行的长度:
rdd.map(_.length).collect
res1: Array[Int] = Array(13, 16)
再看flatMap将一个RDD的长度N转化为N collection的集合,然后趋于平缓的到一个RDD的结果。
rdd.flatMap(_.split(" ")).collect
res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")
每行,我们有多个单词和多行,但是我们最终得到一个输出数组的单词。
2、总结
(1)flatMap 和 map 两个 transforms 操作集合的区别
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
(2)转换函数:
map: One element in -> one element out.
flatMap: One element in -> 0 or more elements out (a collection).
参考:
https://stackoverflow.com/questions/22350722/what-is-the-difference-between-map-and-flatmap-and-a-good-use-case-for-each