開發者學堂課程【大資料實時計算架構 Spark 快速入門:Spark 算子操作及總結_3】學習筆記,與課程緊密聯系,讓使用者快速學習知識。
課程位址:
https://developer.aliyun.com/learning/course/100/detail/1693Spark 算子操作及總結_3
内容簡介:
一、JoinOperator 相關代碼
二、選擇存儲級别
20 //模拟集合
21 List> nameList = Arrays . asList(
22 new Tuple2(1, "xuruyun"),
23 new Tuple2(2, "liangyongqi"),
24 new Tuple2(3, "wangfei"),
25 new Tuple2(3, " annie"));
26
27 List scoreList = Arrays.asList(
28 new Tuple2(1, 150),
29 new Tuple2(2, 100),
30 new Tuple2(3, 80),
31 new Tuple2(3, 90));
32
33 JavaPairRDD nameRDD = sc .parallelizePairs(namelist);
34 JavaPairRDD scoreRDD = sc. parallelizePairs(scorelist);
35
Which Storage Level to Choose?
Sparks storage levels are meant to provide difrere trade_ offs between memory usage and CPU effciency. We recommend going through tne following process to select one:
If your RDDS fit comfortably with the default storage level (MEMORY_ ONLY),leave them that way. This is the most CPU_eficient oplion, allwing operations on the RDDS to run as fast as possible.
If not, try usingMEMORY _ONLY_ SER and selecting a fast serialization library to make the objects much more space_ eficient, but still reasonble fast to access.
Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise,recomputing a partition may be as fast as reading it from disk.
Use the replicated storage levels f you want fast fault recovery (e.g if using Spark to serve requests from a web pplication) All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waitingto recompute a lost partition.
in environments with high amounts of memory or multiple applications. the experimentaloFF HEAP mode has several advantages:
it allows multiple executors to share the same pool of memory in Tachyon.
it significantly reduces garbage collection costs.
Cached data is not lost if individual executors crash.
譯文:選擇哪個存儲級别?
Sparks 存儲級别旨在提供記憶體使用量和 CPU 效率之間的差異權衡。我們建議通過以下過程選擇一個:如果您的 RDDS 與預設存儲級别(僅記憶體)相适應、請離開他們在那邊。
這是 rdds 上最常用的 cpu_eficient oplion 操作。跑得越快越好如果沒有,嘗試使用 ME MORY_ ONLY_ SER 并選擇個快速的序列化庫 ,以使對象更節省空間, 但仍然可以快速通路。
不要溢出到磁盤,除非計算資料集的函數非常昂貴,或者它們過濾了大量資料。否則,重新計算分區可能與從磁盤讀取分區一樣快。
如果您想要快速故障恢複,請使用複制的存儲級别(例如:如果使用 Spark 來服務來自網絡應用程式的請求)所有存儲級别都提供完整的故障通過重新計算丢失的資料來容忍。
但複制的資料允許您繼續在 RDD 而不必等待重新計算丢失的分區。
在高記憶體或多個應用程式的環境中。
OFF HEAP 模式有幾個優點
它允許多個執行器共享同一個記憶體池中的超光速粒子。
它顯著降低了垃圾收內建本。
如果個别執行程式崩潰,緩存的資料不會丢失。