Spark 算子操作及總結_3

2021-12-18 23:50:00

開發者學堂課程【大資料實時計算架構 Spark 快速入門：Spark 算子操作及總結_3】學習筆記，與課程緊密聯系，讓使用者快速學習知識。

課程位址：

https://developer.aliyun.com/learning/course/100/detail/1693

Spark 算子操作及總結_3

内容簡介：

一、JoinOperator 相關代碼

二、選擇存儲級别

20  //模拟集合

21  List> nameList = Arrays . asList(

22  new Tuple2(1, "xuruyun"),

23  new Tuple2(2, "liangyongqi"),

24  new Tuple2(3, "wangfei"),

25  new Tuple2(3, " annie"));

27  List scoreList = Arrays.asList(

28  new Tuple2(1, 150),

29  new Tuple2(2, 100),

30  new Tuple2(3, 80),

31  new Tuple2(3, 90));

33  JavaPairRDD nameRDD = sc  .parallelizePairs(namelist);

34  JavaPairRDD scoreRDD = sc. parallelizePairs(scorelist);

Which Storage Level to Choose?

Sparks storage levels are meant to provide difrere trade_ offs between memory usage and CPU effciency. We recommend going through tne following process to select one:

If your RDDS fit comfortably with the default storage level (MEMORY_ ONLY)，leave them that way. This is the most CPU_eficient oplion, allwing operations on the RDDS to run as fast as possible.

If not, try usingMEMORY _ONLY_ SER and selecting a fast serialization library to make the objects much more space_ eficient, but still reasonble fast to access.

Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise,recomputing a partition may be as fast as reading it from disk.

Use the replicated storage levels f you want fast fault recovery (e.g if using Spark to serve requests from a web pplication) All the storage levels provide full fault tolerance by recomputing lost data， but the replicated ones let you continue running tasks on the RDD without waitingto recompute a lost partition.

in environments with high amounts of memory or multiple applications. the experimentaloFF HEAP mode has several advantages：

it allows multiple executors to share the same pool of memory in Tachyon.

it significantly reduces garbage collection costs.

Cached data is not lost if individual executors crash.

譯文:選擇哪個存儲級别?

Sparks 存儲級别旨在提供記憶體使用量和 CPU 效率之間的差異權衡。我們建議通過以下過程選擇一個:如果您的 RDDS 與預設存儲級别(僅記憶體)相适應、請離開他們在那邊。

這是 rdds 上最常用的 cpu_eficient oplion 操作。跑得越快越好如果沒有，嘗試使用 ME MORY_ ONLY_ SER 并選擇個快速的序列化庫，以使對象更節省空間，但仍然可以快速通路。

不要溢出到磁盤，除非計算資料集的函數非常昂貴，或者它們過濾了大量資料。否則，重新計算分區可能與從磁盤讀取分區一樣快。

如果您想要快速故障恢複，請使用複制的存儲級别(例如:如果使用 Spark 來服務來自網絡應用程式的請求)所有存儲級别都提供完整的故障通過重新計算丢失的資料來容忍。

但複制的資料允許您繼續在 RDD 而不必等待重新計算丢失的分區。

在高記憶體或多個應用程式的環境中。

OFF HEAP 模式有幾個優點

它允許多個執行器共享同一個記憶體池中的超光速粒子。

它顯著降低了垃圾收內建本。

如果個别執行程式崩潰，緩存的資料不會丢失。

Spark 算子操作及總結_3

Spark 算子操作及總結_3

繼續閱讀

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

淺談企業活動中進行資料分析的重要性

Ambari介紹和架構原理

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark