天天看点

[Spark基础]-- spark rdd collect操作官方解释

官方原文如下

Printing elements of an RDD

Another common idiom is attempting to print out the elements of an RDD using ​

​rdd.foreach(println)​

​​ or ​

​rdd.map(println)​

​​. On a single machine, this will generate the expected output and print all the RDD’s elements. However, in ​

​cluster​

​​ mode, the output to ​

​stdout​

​​ being called by the executors is now writing to the executor’s ​

​stdout​

​​ instead, not the one on the driver, so ​

​stdout​

​​ on the driver won’t show these! To print all elements on the driver, one can use the ​

​collect()​

​​ method to first bring the RDD to the driver node thus: ​

​rdd.collect().foreach(println)​

​​. This can cause the driver to run out of memory, though, because ​

​collect()​

​​ fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the ​

​take()​

​​: ​

​rdd.take(100).foreach(println)​

​.

主要意思是:

打印一个弹性分布式数据集元素,使用时要注意不要导致内存溢出!

建议使用 ​

​take()​

​​: ​

​rdd.take(100).foreach(println),​

​而不使用rdd.collect().foreach(println)。​

​因为后者会导致内存溢出!!​

继续阅读