天天看點

Spark 算子操作及總結_3

開發者學堂課程【大資料實時計算架構 Spark 快速入門:Spark 算子操作及總結_3】學習筆記,與課程緊密聯系,讓使用者快速學習知識。

課程位址:

https://developer.aliyun.com/learning/course/100/detail/1693

Spark 算子操作及總結_3

内容簡介:

一、JoinOperator 相關代碼

二、選擇存儲級别

20  //模拟集合

21  List> nameList = Arrays . asList(

22  new Tuple2(1, "xuruyun"),

23  new Tuple2(2, "liangyongqi"),

24  new Tuple2(3, "wangfei"),

25  new Tuple2(3, " annie"));

26

27  List scoreList = Arrays.asList(

28  new Tuple2(1, 150),

29  new Tuple2(2, 100),

30  new Tuple2(3, 80),

31  new Tuple2(3, 90));

32

33  JavaPairRDD nameRDD = sc  .parallelizePairs(namelist);

34  JavaPairRDD scoreRDD = sc. parallelizePairs(scorelist);

35

Which Storage Level to Choose?

Sparks storage levels are meant to provide difrere trade_ offs between memory usage and CPU effciency. We recommend going through tne following process to select one:

If your RDDS fit comfortably with the default storage level (MEMORY_ ONLY),leave them that way. This is the most CPU_eficient oplion, allwing operations on the RDDS to run as fast as possible.

If not, try usingMEMORY _ONLY_ SER and selecting a fast serialization library to make the objects much more space_ eficient, but still reasonble fast to access.

Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise,recomputing a partition may be as fast as reading it from disk.

Use the replicated storage levels f you want fast fault recovery (e.g if using Spark to serve requests from a web pplication) All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waitingto recompute a lost partition.

in environments with high amounts of memory or multiple applications. the experimentaloFF HEAP mode has several advantages:

it allows multiple executors to share the same pool of memory in Tachyon.

it significantly reduces garbage collection costs.

Cached data is not lost if individual executors crash.

譯文:選擇哪個存儲級别?

Sparks 存儲級别旨在提供記憶體使用量和 CPU 效率之間的差異權衡。我們建議通過以下過程選擇一個:如果您的 RDDS 與預設存儲級别(僅記憶體)相适應、請離開他們在那邊。

這是 rdds 上最常用的 cpu_eficient oplion 操作。跑得越快越好如果沒有,嘗試使用 ME MORY_ ONLY_ SER 并選擇個快速的序列化庫 ,以使對象更節省空間, 但仍然可以快速通路。

不要溢出到磁盤,除非計算資料集的函數非常昂貴,或者它們過濾了大量資料。否則,重新計算分區可能與從磁盤讀取分區一樣快。

如果您想要快速故障恢複,請使用複制的存儲級别(例如:如果使用 Spark 來服務來自網絡應用程式的請求)所有存儲級别都提供完整的故障通過重新計算丢失的資料來容忍。

但複制的資料允許您繼續在 RDD 而不必等待重新計算丢失的分區。

在高記憶體或多個應用程式的環境中。  

OFF HEAP 模式有幾個優點

它允許多個執行器共享同一個記憶體池中的超光速粒子。

它顯著降低了垃圾收內建本。

如果個别執行程式崩潰,緩存的資料不會丢失。

繼續閱讀