1. 背景介紹
EMR叢集中作業寫資料到OSS時,需要先将資料緩存在本地,然後再一次性上傳到OSS中。EMR支援兩種緩存政策:
- disk
- off-heap
兩種緩存測試使用場景略有差別:
- 本地磁盤緩存政策适用到任何場景,且能滿足較大檔案上傳需求。
- 堆外記憶體緩存政策在性能上較磁盤緩存有優勢,但是受限于記憶體資源。在實作上,堆外記憶體的申請會限制在一定範圍内,當資料産生速率超過資料上傳速率時,輸出流會block住,需要等待進行中的上傳任務完成。
潛在問題:
- 作業送出到Yarn:當使用堆外記憶體政策時,存在記憶體超用被Yarn殺掉的風險。是以在記憶體參數設定上需要格外小心,不然會影響到作業的穩定性。
2. 如何使用
作業參數中配置"fs.oss.upload.bufferType",可選值為"disk"或者"off-heap"。以下舉例:
1. hadoop fs -Dfs.oss.upload.bufferType=disk -put a.txt oss://xxx/xxx/
2. Hadoop作業:
Configuration conf = new Configuration()
conf.set("fs.oss.upload.bufferType", "off-heap")
...
3. Spark作業:
val conf = new SparkConf()
conf.set("spark.hadoop.fs.oss.upload.bufferType", "off-heap")
...
3. Benchmark
VPC網絡,SSD雲盤/高效雲盤,MN4,4核16G機型,測試純寫資料時間。
檔案大小 | 塊大小 | 并發度 | Disk buffer (SSD雲盤) | Disk buffer (高效雲盤) | Off-heap buffer | vs. SSD雲盤 性能提升(%) | vs.高效雲盤 性能提升(%) |
---|---|---|---|---|---|---|---|
1024MB | 256KB | 5 | 23009ms | 20773ms | 18661ms | +18.8% | 10.2% |
1MB | 11310ms | 18524ms | 10233ms | +9.5% | +44.8% | ||
4MB | 10318ms | 18001ms | 10191ms | +1.5% | +43.4% | ||
16MB | 10212ms | 17796ms | 10184ms | +0.3% | +42.8% | ||
64MB | 10945ms | 18612ms | 10216ms | +6.7% | +45.1% | ||
128MB | 13240ms | 20181ms | OOM: Direct buffer memory | N/A | |||
256MB | 4511ms | 4968ms | 4636ms | -2.7% | |||
2417ms | 4474ms | 2381ms | +46.8% | ||||
4386ms | 2433ms | -0.7% | +44.3% | ||||
4337ms | 2465ms | -1.3% | +43.2% | ||||
3232ms | 5273ms | 2411ms | +33.7% | +54.3% | |||
4392ms | 6197ms | 3118ms | +29.0% | +49.7% | |||
1252ms | 1337ms | +0% | +6.4% | ||||
611ms | 1117ms | 577ms | +5.6% | +48.3% | |||
567ms | 1084ms | 559ms | +1.4% | +48.4% | |||
597ms | 1108ms | 624ms | -4.5% | +43.7% | |||
1569ms | 1491ms | 1499ms | +4.5% | -0.5% | |||
1459ms | 1730ms | 1412ms | +3.2% | +18.4% | |||
459ms | 417ms | 383ms | +16.6% | +8.2% | |||
221ms | 307ms | 220ms | +28.3% | ||||
254ms | 327ms | 198ms | +22.0% | +39.4% | |||
431ms | 398ms | 418ms | +3.0% | -5% | |||
412ms | 425ms | 400ms | +2.9% | +5.9% | |||
405ms | 443ms | -5.9% | -9.3% |