該文為個人學習筆記，僅供參考。

更多内容關注本人Halo1部落格

Hadoop概述

概述

Hadoop是一個由Apache基金會所開發的分布式系統基礎架構。

使用者可以在不了解分布式底層細節的情況下，開發分布式程式。充分利用叢集的威力進行高速運算和存儲。

Apache Hadoop 原本來源于 Google 一款名為MapReduce的程式設計模型包。

GFS -> HDFS

MapReduce -> MapReduce

BigTable -> HBase

Hadoop主要組成

Hadoop Common：為其他Hadoop子產品提供基礎設施。
Hadoop HDFS：一個高可靠、高吞吐量的分布式檔案系統。
- 分布式
- 安全性
- 副本資料
Hadoop MapReduce：一個分布式的離線并行計算架構。
- 分布式
- 思想
  - 分而治之
    
    大資料集分為小的資料集
    
    每個資料集，進行邏輯業務處理（map）
    
    合并統計資料結果（reduce）
Hadoop YARN：一個新的MapReduce架構，任務排程與資源管理。
分布式資源管理架構
- 管理整個叢集的資源（記憶體、CPU核數）
- 配置設定排程叢集的資源

HDFS 分布式檔案系統

HDFS架構

NameNode 用于儲存中繼資料，處理用戶端請求管理節點

管理檔案系統的命名空間。維護着檔案系統樹及整棵樹内所有的檔案和目錄。這些資訊以命名空間鏡像檔案和編輯日志檔案的形式永久儲存在本地磁盤上。

Client 發起請求

代表使用者通過namenode和datanode互動來通路整個檔案系統。提供檔案系統接口。

DataNode 儲存具體資料

檔案系統的工作節點。他們根據需要存儲并檢索資料塊（受用戶端或namenode排程）。

SecondaryNode 用于同步中繼資料，

HDFS沒有namenode将無法運作。是以secondarynode作為第二名稱節點，用來監控HDFS狀态的輔助背景程式，每隔一段時間擷取HDFS中繼資料的快照。

Hadoop讀寫流程

Hadoop基礎知識筆記Hadoop概述HDFS 分布式檔案系統Yarn叢集資源管理系統MapReduce程式設計模型附錄

HDFS寫操作

Hadoop基礎知識筆記Hadoop概述HDFS 分布式檔案系統Yarn叢集資源管理系統MapReduce程式設計模型附錄

用戶端向Namenode發起寫操作請求，Namenode檢查檔案是否存在并檢查權限，如果存在，則傳回exist，不能上傳，不可覆寫，如果不存在，那麼Namenode會建立一個臨時檔案用于存儲。
Namenode放置副本并遵循就近原則，即如果用戶端在該叢集且有Datanode節點，那麼優先在該節點放置第一塊副本，再将另外兩塊副本放置到另外兩台Datanode（檔案副本數通常為3，可根據需求及叢集性能自定義），并根據距離排序，将Datanode清單傳回給用戶端。

如果為多叢集，假定3個叢集，則每個叢集中一台Datanode節點儲存一份副本
用戶端根據Namenode傳回的Datanode清單連接配接第一個節點，且第一個節點與其餘兩個節點根據順序保持串聯連接配接關系，這種并聯連接配接關系被成為Pipeline（管道）連接配接。
用戶端将檔案切分成block（block預設128MB）

在寫入過程中，block又被切分成package（預設為64KB），以流式向DataNode寫入block檔案。
Datanode各節點每傳輸完一個block後，會傳回驗證資訊

每次驗證是在寫完一個block後，并非在傳輸package完成時
完成檔案寫入，關閉輸出流。
發送完成信号給Namenode

在傳輸過程中，其中一個Datanode挂掉了，Namenode會在整個流程結束，收集并驗證Datanode的資訊，發現此檔案的副本數沒有達到要求，則将挂掉的datanode移出Pipeline，然後再尋找另一個可用Datanode節點儲存副本

HDFS讀操作

Hadoop基礎知識筆記Hadoop概述HDFS 分布式檔案系統Yarn叢集資源管理系統MapReduce程式設計模型附錄

用戶端發起向Namenode讀請求并擷取中繼資料資訊。
用戶端根據Namenode傳回的資訊，就近選擇Datanode節點并與各個節點建立輸入流。
Datanode向輸入流中寫資料。
用戶端下載下傳完成block後驗證，保證塊資料的完整性。

就近原則：本地，最近的距離；同機架，次之的距離；other（資料中心），最遠的距離；

HDFS啟動流程

心跳機制

namenode從叢集中的每個datanode接收心跳信号和快狀态報告（Blockreport）。

接收心跳信号

預設每3秒一次，表示改datanode節點工作正常

心跳傳回結果帶有namenode給該datanode的指令，如複制删除等等。

超過10分鐘namenode沒有收到某個datanode的心跳，則認為該節點不可用。
塊狀态報告

datanode啟動後，向namenode注冊，周期性（1小時）的向namenode報告塊資訊。

資料損壞處理

block建立時會産生checksum，當datanode讀取block的時候，它會重新計算得到checksum，如果與初始checksum不同，則說明該block已經損壞，Client讀取其他datanode上的block。那麼弄得标記該塊已損壞，複制block以達到預期的副本數量。

Namenode中繼資料合并同步

NameNode的中繼資料操作先往edits檔案中寫，

當edits檔案達到一定的門檻值(3600秒或一定的事務數量)的時候，會開啟合并的流程。

合并流程:

1.當開始合并的時候，SecondaryNameNode會把edits和fsimage拷貝到所在伺服器所在記憶體中，合并生成名為fsimage.ckpt的檔案。

2.将fsimage.ckpt檔案拷貝到NameNode上，删除原有的fsimage，并将fsimage.ckpt重命名為fsimage。

3.當SecondaryNameNode将edits和fsimage拷貝走之後， NameNode會立刻生成一個edits.new檔案，用于記錄新來的中繼資料，當合并完成之後，原有的edits檔案才會被删除，并将edits.new檔案重命名為edits檔案。

在達到一定門檻值開啟下一輪合并，隻拷貝edits檔案。

啟動流程

啟動HDFS相關程序，進入安全模式（隻讀不寫）
加載中繼資料，namenode等待datanode注冊
datanode周期性的向namenode發送心跳
離開安全模式

Yarn叢集資源管理系統

Yarn架構

ResourceManger 任務的排程和資源的管理（CPU、記憶體）管理叢集上資源使用的資料總管。

Container 對環境的抽象，封裝了CPU，記憶體，環境變量等等多元的資源。運作在叢集中所有節點上且能夠啟動和監控的容器。

NodeManger 單個節點的資源管理，節點管理器

application master 為任務程式申請資源，任務的監控和容錯

Yarn運作機制

Hadoop基礎知識筆記Hadoop概述HDFS 分布式檔案系統Yarn叢集資源管理系統MapReduce程式設計模型附錄

MapReduce程式設計模型

概述

MapReduce應用廣泛的原因之一就是其易用性，提供了一個高度抽象畫二變得非常簡單的程式設計模型，他是總結大量應用的共同特點的基礎上抽象出來的分布式計算架構，在其程式設計模型中，任務可以被分解成互相獨立的子問題。任務過程分為兩個處理階段：map階段和reduce階段。每階段都以鍵-值對作為輸入和輸出，其資料類型可自定義，需要編寫map和reduce函數。

MapReduce作業（job）是用戶端需要執行的一個工作單元：它包括輸入資料、MapReduce程式和配置資訊。Hadoop将作業分成若幹個任務（task）來執行，其中包括兩類任務：map任務和reduce任務。

MapReduce程式設計模型給出分布式程式設計方法的5個步驟：

疊代，周遊輸入資料，将其解析成key/value對；
将輸入key/value對映射map成另外一些key/value對；
根據key對中間結果進行分組（grouping）；
以組為機關對資料進行歸約；
疊代，将最終産生的key/value對儲存到輸出檔案中。

過程及作用

執行順序：InputFormat - OutputFormat - Map - Shuffle - Reduce

Hadoop基礎知識筆記Hadoop概述HDFS 分布式檔案系統Yarn叢集資源管理系統MapReduce程式設計模型附錄

InputFormat

OutputFormat

Map

Reduce

Shuffle

Map Shuffle Phase

進入環形緩沖區（預設100MB）

預設情況下，當達到環形緩沖區記憶體的80%，将會将緩沖區中的資料spill到本地磁盤中（溢出到MapTask所運作的NodeManager機器的本地磁盤中）
溢寫并不是立即将緩沖區中的資料溢寫到本地磁盤，而是需要經曆一些操作
- 分區paritioner

分區決定Map Task輸出的資料進入哪個Reduce，且分區數量等同于Reduce數量。

預設情況下，key采用HashParitioner

自定義Paritioner用途

解決資料傾斜：另一種需要我們自己定義一個Partitioner的情況是各個Reduce task處理的鍵值對數量極不平衡。對于某些資料集，由于很多不同的key的hash值都一樣，導緻這些鍵值對都被分給同一個Reducer處理，而其他的Reducer處理的鍵值對很少，進而拖延整個任務的進度。當然，編寫自己的Partitioner必須要保證具有相同key值的鍵值對分發到同一個Reducer。
對于map輸出的每一個鍵值對，系統都會給定一個partition，partition值預設通過計算key的hash值後對Reduce task的數量取模獲得。如果一個鍵值對的partition值為1，意味着這個鍵值對會交給第一個Reducer處理。
```
//HashPartitioner
int getParitition（key, value, numreducetask） {
	return ( key.hashCode&Integer.maxValue)%numreducetask;
}
           
```
排序sorter

會對每個分區中的資料進行排序，預設情況下依據key進行排序。
- spill溢寫
  
  将分區排序後的資料寫到本地磁盤的一個檔案中，重複上述操作，産生多個小檔案。
當溢寫結束後

[可選]combiner：Map端的reduce，在分區前

[可選]compress：減少資料量，減少網絡IO處理，但壓縮消耗CPU性能，也需要時間。

Reduce Shuffle Phase

Reduce端的shuffle主要包括三個階段，copy，sort(merge)，reduce

Copy
排序（merge）
分組group

Map-side tuning properties

Property name	Type	Default value	Description
mapreduce.task.io.sort.mb	int	100	The size, in megabytes, of the memory buffer to use while sorting map output.
mapreduce.map.sort.spill.percent	float	0.80	The threshold usage proportion for both the map output memory buffer and the record boundaries index to start the process of spilling to disk.
mapreduce.task.io.sort.facto	int	10	The maximum number of streams to merge at once when sorting files. This property is also used in the reduce. It’s fairly common to increase this to 100.
mapreduce.map.combine.minspills	int	3	The minimum number of spill files needed for the combiner to run (if a combiner is specified).
mapreduce.map.output.compress	boolean	false	Whether to compress map outputs.
mapreduce.map.output.compress.codec	Class name	org.apache.hadoop.io.compress.DefaultCodec	The compression codec to use for map outputs.
mapreduce.shuffle.max.threads	int	The number of worker threads per node manager for serving the map outputs to reducers. This is a clusterwide setting and cannot be by individual jobs. 0 means use the Netty default of twice the number of available processors.

Reduce-side tuning properties

Property name	Type	Default value	Description
mapreduce.reduce.shuffle.parallelcopies	int	5	The number of threads used to copy map outputs to the reducer.
mapreduce.reduce.shuffle.maxfetchfailures	int	10	The number of times a reducer tries to fetch a map output before reporting the error.
mapreduce.task.io.sort.factor	int	10	The maximum number of streams to merge at once when sorting files. This property is also used in the map.
mapreduce.reduce.shuffle.input.buffer.percent	float	0.70	The proportion of total heap size to be allocated to the map outputs buffer during the copy phase of the shuffle.
mapreduce.reduce.shuffle.merge.percent	float	0.66	The threshold usage proportion for the map outputs buffer (defined by mapred.job.shuffle.input.buffer.percent)for starting the process of merging the outputs and spilling to disk.
mapreduce.reduce.merge.inmem.threshol	int	1000	The threshold number of map outputs for starting the process of merging the outputs and spilling to disk. A value of 0 or less means there is no threshold, and the spill behavior is governed solely by mapreduce.reduce.shuffle.merge.percent.
mapreduce.reduce.input.buffer.percent	float	0.0	The proportion of total heap size to be used for retaining map outputs in memory during the reduce. For the reduce phase to begin, the size of map outputs in memory must be no more than this size. By default, all map outputs are merged to disk before the reduce begins, to give the reducers as much memory as possible. However, if your reducers require less memory,this value may be increased to minimize the number of trips to disk

附錄

叢集架構設計示例：

master	slave1	slave2	slave3
HDFS	datanode	datanode	datanode
namenode 9820	secondaryNamenode
namenode web 9870	secondaryNamenode web 9868
Yarn	nodemanager	nodemanager	nodemanager
resourcemanager	resourcemanager
resourcemanager web 8088
曆史服務	HistroryServer 10020
HistroryServer web 19888

服務端口

2.x、3.x版本端口差别

摘自網絡

分類	應用	Haddop 2.x port	Haddop 3 port
NNPorts	Namenode	8020	9820
NNPorts	NN HTTP UI	50070	9870
NNPorts	NN HTTPS UI	50470	9871
SNN ports	SNN HTTP	50091	9869
SNN ports	SNN HTTP UI	50090	9868
DN ports	DN IPC	50020	9867
DN ports	DN	50010	9866
DN ports	DN HTTP UI	50075	9864
DN ports	Namenode	50475	9865

元件	節點	預設端口	配置	用途說明
HDFS	DataNode	50010	dfs.datanode.address	datanode服務端口，用于資料傳輸
HDFS	DataNode	50075	dfs.datanode.http.address	http服務的端口
HDFS	DataNode	50475	dfs.datanode.https.address	https服務的端口
HDFS	DataNode	50020	dfs.datanode.ipc.address	ipc服務的端口
HDFS	NameNode	50070	dfs.namenode.http-address	http服務的端口
HDFS	NameNode	50470	dfs.namenode.https-address	https服務的端口
HDFS	NameNode	8020	fs.defaultFS	接收Client連接配接的RPC端口，用于擷取檔案系統metadata資訊。
HDFS	journalnode	8485	dfs.journalnode.rpc-address	RPC服務
HDFS	journalnode	8480	dfs.journalnode.http-address	HTTP服務
HDFS	ZKFC	8019	dfs.ha.zkfc.port	ZooKeeper FailoverController，用于NN HA
YARN	ResourceManager	8032	yarn.resourcemanager.address	RM的applications manager(ASM)端口
YARN	ResourceManager	8030	yarn.resourcemanager.scheduler.address	scheduler元件的IPC端口
YARN	ResourceManager	8031	yarn.resourcemanager.resource-tracker.address	IPC
YARN	ResourceManager	8033	yarn.resourcemanager.admin.address	IPC
YARN	ResourceManager	8088	yarn.resourcemanager.webapp.address	http服務端口
YARN	NodeManager	8040	yarn.nodemanager.localizer.address	localizer IPC
YARN	NodeManager	8042	yarn.nodemanager.webapp.address	http服務端口
YARN	NodeManager	8041	yarn.nodemanager.address	NM中container manager的端口
YARN	JobHistory Server	10020	mapreduce.jobhistory.address	IPC
YARN	JobHistory Server	19888	mapreduce.jobhistory.webapp.address	http服務端口
HBase	Master	60000	hbase.master.port	IPC
HBase	Master	60010	hbase.master.info.port	http服務端口
HBase	RegionServer	60020	hbase.regionserver.port	IPC
HBase	RegionServer	60030	hbase.regionserver.info.port	http服務端口
HBase	HQuorumPeer	2181	hbase.zookeeper.property.clientPort	HBase-managed ZK mode，使用獨立的ZooKeeper叢集則不會啟用該端口。
HBase	HQuorumPeer	2888	hbase.zookeeper.peerport	HBase-managed ZK mode，使用獨立的ZooKeeper叢集則不會啟用該端口。
HBase	HQuorumPeer	3888	hbase.zookeeper.leaderport	HBase-managed ZK mode，使用獨立的ZooKeeper叢集則不會啟用該端口。
Hive	Metastore	9083	/etc/default/hive-metastore中export PORT=來更新預設端口
Hive	HiveServer	10000	/etc/hive/conf/hive-env.sh中export HIVE_SERVER2_THRIFT_PORT=來更新預設端口
ZooKeeper	Server	2181	/etc/zookeeper/conf/zoo.cfg中clientPort=	對用戶端提供服務的端口
ZooKeeper	Server	2888	/etc/zookeeper/conf/zoo.cfg中server.x=[hostname]:nnnnn[:nnnnn]，标藍部分	follower用來連接配接到leader，隻在leader上監聽該端口。
ZooKeeper	Server	3888	/etc/zookeeper/conf/zoo.cfg中server.x=[hostname]:nnnnn[:nnnnn]，标藍部分	用于leader選舉的。隻在electionAlg是1,2或3(預設)時需要。

Halo：基于Java的開源部落格，感謝halo提供如此強大的部落格系統。 ↩︎

Hadoop基礎知識筆記Hadoop概述HDFS 分布式檔案系統Yarn叢集資源管理系統MapReduce程式設計模型附錄

Hadoop概述

概述

Hadoop主要組成

Hadoop Common：為其他Hadoop子產品提供基礎設施。

Hadoop HDFS：一個高可靠、高吞吐量的分布式檔案系統。

Hadoop MapReduce：一個分布式的離線并行計算架構。

Hadoop YARN：一個新的MapReduce架構，任務排程與資源管理。

HDFS 分布式檔案系統

HDFS架構

Hadoop讀寫流程

HDFS寫操作

HDFS讀操作

HDFS啟動流程

心跳機制

合并流程:

啟動流程

Yarn叢集資源管理系統

Yarn架構

Yarn運作機制

MapReduce程式設計模型

概述

過程及作用

InputFormat

OutputFormat

Map

Reduce

Shuffle

Map Shuffle Phase

Reduce Shuffle Phase

Map-side tuning properties

Reduce-side tuning properties

附錄

叢集架構設計示例：

服務端口

繼續閱讀