[Hadoop]Hadoop上Data Locality

2016-12-25 23:50:00

Hadoop上的Data Locality是指資料與Mapper任務運作時資料的距離接近程度（Data Locality in Hadoop refers to the“proximity” of the data with respect to the Mapper tasks working on the data.）

1. why data locality is imporant?

當資料集存儲在HDFS中時，它被劃分為塊并存儲在Hadoop叢集中的DataNode上。當在資料集執行MapReduce作業時，各個Mappers将處理這些塊（輸進行入分片處理）。如果Mapper不能從它執行的節點上擷取資料，資料需要通過網絡從具有這些資料的DataNode拷貝到執行Mapper任務的節點上（the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task）。假設一個MapReduce作業具有超過1000個Mapper，在同一時間每一個Mapper都試着去從叢集上另一個DataNode節點上拷貝資料，這将導緻嚴重的網絡阻塞，因為所有的Mapper都嘗試在同一時間拷貝資料（這不是一種理想的方法）。是以，将計算任務移動到更接近資料的節點上是一種更有效與廉價的方法，相比于将資料移動到更接近計算任務的節點上（it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation）。

2. How is data proximity defined?

當JobTracker（MRv1）或ApplicationMaster（MRv2）接收到運作作業的請求時，它檢視叢集中的哪些節點有足夠的資源來執行該作業的Mappers和Reducers。同時需要根據Mapper運作資料所處位置來考慮決定每個Mapper執行的節點（serious consideration is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper is located）。

3. Data Local

當資料所處的節點與Mapper執行的節點是同一節點，我們稱之為Data Local。在這種情況下，資料的接近度更接近計算（ In this case the proximity of the data is closer to the computation.）。JobTracker（MRv1）或ApplicationMaster（MRv2）首選具有Mapper所需要資料的節點來執行Mapper。

4. Rack Local

雖然Data Local是理想的選擇，但由于受限于叢集上的資源，并不總是在與資料同一節點上執行Mapper（Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data due to resource constraints on a busy cluster）。在這種情況下，優選地選擇在那些與資料節點在同一機架上的不同節點上運作Mapper（ In such instances it is preferred to run the Mapper on a different node but on the same rack as the node which has the data.）。在這種情況下，資料将在節點之間進行移動，從具有資料的節點移動到在同一機架上執行Mapper的節點，這種情況我們稱之為Rack Local。

5. Different Rack

在繁忙的群集中，有時Rack Local也不可能。在這種情況下，選擇不同機架上的節點來執行Mapper，并且将資料從具有資料的節點複制到在不同機架上執行Mapper的節點。這是最不可取的情況。

[Hadoop]Hadoop上Data Locality

1. why data locality is imporant?

2. How is data proximity defined?

3. Data Local

4. Rack Local

5. Different Rack

繼續閱讀

Windows下Cygwin環境的Hadoop安裝（3）- 運作hadoop中的wordcount執行個體遇到的問題和解決方法

MapReduce運作Wordcount時一直卡在INFO mapreduce.Job: Running job，web檢視一直處于accepted階段

ubuntu hadoop2.6.1，terminal下運作wordcount

MapReduce(一)：入門級程式wordcount及其分析

hadoop操作遇到的問題問題一：輸出檔案已存在

Hadoop之運作wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理