天天看點

Spark textFile 和排序-1

開發者學堂課程【大資料實時計算架構 Spark 快速入門:Spark textFile 和排序-1】學習筆記,與課程緊密聯系,讓使用者快速學習知識。

課程位址:

https://developer.aliyun.com/learning/course/100/detail/1694

Spark textFile 和排序-1

内容簡介:

一、資料序列化

二、Java 序列化

三、Kryo 序列化

一、Data Serialization(資料序列化)

Serlalzation plays an Important role In the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number 0f bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application.Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:

譯文:服務化在任何分布式應用程式的性能中起着重要的作用。将對象序列化為Into 或消耗大量位元組的格式将大大降低計算速度。通常,這将是優化 Spark 應用程式的第一件事。

Spark 的目标是在友善性(允許您在操作中使用任何 Java 類型)和性能之間取得平衡。它提供了兩個分類圖書館。

Java serialization: By default, Spark serializes objects using Java's objectoutputStream framework, and can work with any class you createthat implements java. io. Serializable. You can also control the performance of your serialization more closely by extendingjava. io. Externalizable. Java seralization Is flexible but often quite slow, and leads to large serallzed ormats for many classes.

譯文:

Java 序列化:預設情況下,Spark 使用 Java 的 objectoutputStream 架構序列化對象,并且可以使用您建立的實作 Java 的任何類。還可以通過擴充 java 來更密切地控制序列化的性能。外置 Java 伺服器化是靈活的,但通常非常慢,并導緻了許多類的大的伺服器化格式。

Kryo serialization: Spark can also use the Kryo library(version 2) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you'l l use in the program in advance for best performance.

譯文:Kryo 序列化: Spark 還可以使用 Kryo 庫](版本2)來更快地序列化對象。Kryo比 Java 序列化更快、更緊湊(通常高10倍),但并不支援所有可序列化的類型,并且需要注冊類。你可以提前在程式中使用In來獲得最佳性能。

You can switch to using Kryo by initializing your job with a SparkConf and calling conf set("spark. serializer","org. apache. spark. serializer. KryoSerializer"), This setting configures the serializer used for not only shuffling data between worker nodesbut also when serializing RDDs to disk. The only reason Kryo is not the default Is because 0f the custom registration requirement, but we recommend trying it in any network-Intensive application.

Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AIIScalaRegistrar from the Twitterchill library.

To register your own custom classes with Kryo, use the registerKryoClasses method.

val conf= new SparkConf(). setMaster(...). setAppName(...)

conf. registerKryoC1 asses (Array(classof [MyC1ass1], c1ass0f MyC1ass2]))

val sc = new SparkContext(conf)

譯文:您可以通過使用 Spark Conf 初始化作業并調用conf來切換到使用 Kryo。

此設定不僅配置用于在輔助節點之間交換資料的序列化器,而且還配置用于序列化時的資料交換器 RDDs 磁盤。Kryo 不是預設的唯一原因是因為of的自定義注冊要求,但我們建議嘗試它在任何網絡密集型應用中。Spark 自動包含了許多常用核心Scala 類的 Kryo 序列化器,這些類包含在 Twitter chill 的 AIIScalaRegistrar 中要用 Kryo 注冊自己的定制類,請使用 registerKryoclasses 方法。

The Kryo documentation describes more advanced registration options, such as adding custom serialization code.if your objects are large, you may also need to increase the spark. kryoserializer. buffer config. This value needs to be large enough to hold the largest object you will serialize.

Finally, if you don't register your custom classes, Kryo will still work, but it will have to store the full class name with each object. which is wasteful.

譯文:Kryo 文檔描述了更進階的注冊選項,比如添加自定義序列化代碼。

當你的對象很大時你可能還需要增加火花。Kypo srialize .buffer 緩沖區配置這個Value需要足夠大的空間來容納要序列化的最大對象。最後,如果您不注冊您的自定義類,Kryo 将仍然工作,但它将不得不與每個對象存儲完整的類名,這是浪費。