Hadoop企業開發場景案例
1 案例需求
(1)需求:從1G資料中,統計每個單詞出現次數。伺服器3台,每台配置4G記憶體,4核CPU,4線程。
(2)需求分析:
1G/128m = 8個MapTask;1個ReduceTask:1個mrAppMaster
平均每個節點運作10個/3台 ≈ 3個任務(4 3 3)
2 HDFS參數調優
(1)修改:hadoop-env.sh
export HDFS_NAMENODE_OPTS = "-Dhadoop.security.logger=INFO,RFAS -Xmx1024m"
export HDFS_DATANODE_OPTS = "-Dhadoop.security.logger=ERROR,RFAS -Xmx1024m"
(2)修改:hdfs-site.xml
<!--NameNode有一個工作線程池,預設值是10-->
<property>
<name>dfs.namenode.handler.count</name>
<value>21</value>
</property>
(3)修改core-site.xml
<!-- 配置垃圾回收時間為 60 分鐘 -->
<property>
<name>fs.trash.interval</name>
<value>60</value>
</property>
(4)将配置分發到三台伺服器上
rsync -av 分發的檔案名稱 使用者名@主機名稱:儲存配置檔案位址
3 MapReduce 參數調優
(1)修改mapred-site.xml
<!-- 環形緩沖區大小,預設 100m -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>100</value>
</property>
<!-- 環形緩沖區溢寫門檻值,預設 0.8 -->
<property>
<name>mapreduce.map.sort.spill.percent</name>
<value>0.80</value>
</property>
<!-- merge 合并次數,預設 10 個 -->
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>10</value>
</property>
<!-- maptask 記憶體,預設 1g; maptask 堆記憶體大小預設和該值大小一緻 mapreduce.map.java.opts -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>-1</value>
<description>
The amount of memory to request from the scheduler for each map task. If this is not specified or is non-positive, it is inferred from mapreduce.map.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
</description>
</property>
<!-- matask 的 CPU 核數,預設 1 個 -->
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>1</value>
</property>
<!-- matask 異常重試次數,預設 4 次 -->
<property>
<name>mapreduce.map.maxattempts</name>
<value>4</value>
</property>
<!-- 每個 Reduce 去 Map 中拉取資料的并行數。預設值是 5 -->
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>5</value>
</property>
<!-- Buffer 大小占 Reduce 可用記憶體的比例,預設值 0.7 -->
<property>
<name>mapreduce.reduce.shuffle.input.buffer.percent</name>
<value>0.70</value>
</property>
<!-- Buffer 中的資料達到多少比例開始寫入磁盤,預設值 0.66。 -->
<property>
<name>mapreduce.reduce.shuffle.merge.percent</name>
<value>0.66</value>
</property>
<!-- reducetask 記憶體,預設 1g;reducetask 堆記憶體大小預設和該值大小一緻 mapreduce.reduce.java.opts -->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>-1</value>
<description>The amount of memory to request from the scheduler for each reduce task. If this is not specified or is non-positive, it is inferred from mapreduce.reduce.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
</description>
</property>
<!-- reducetask 的 CPU 核數,預設 1 個 -->
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>2</value>
</property>
<!-- reducetask 失敗重試次數,預設 4 次 -->
<property>
<name>mapreduce.reduce.maxattempts</name>
<value>4</value>
</property>
<!-- 當MapTask完成的比例達到該值後才會為ReduceTask申請資源。預設是0.05-->
<property>
<name>mapreduce.job.reduce.slowstart.completedmaps</name>
<value>0.05</value>
</property>
<!-- 如果程式在規定的預設 10 分鐘内沒有讀到資料,将強制逾時退出 -->
<property>
<name>mapreduce.task.timeout</name>
<value>600000</value>
</property>
(2)伺服器分發配置檔案
rsync -av 分發的檔案名稱 使用者名@主機名稱:儲存配置檔案位址
4 Yarn參數調優
(1)修改Yarn-site.xml
<!-- 選擇排程器,預設容量 -->
<property>
<description>The class to use as the resource scheduler.</description>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
<!-- ResourceManager 處理排程器請求的線程數量,預設 50;如果送出的任務數大于 50,可以增加該值,但是不能超過 3 台 * 4 線程 = 12 線程(去除其他應用程式實際不能超過 8) -->
<property>
<description>Number of threads to handle schedulerinterface.</description>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>8</value>
</property>
<!-- 是否讓 yarn 自動檢測硬體進行配置,預設是 false,如果該節點有很多其他應用程式,建議
手動配置。如果該節點沒有其他應用程式,可以采用自動 -->
<property>
<description>Enable auto-detection of node capabilities such as memory and CPU.</description>
<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
<value>false</value>
</property>
<!-- 是否将虛拟核數當作 CPU 核數,預設是 false,采用實體 CPU 核數 -->
<property>
<description>Flag to determine if logical processors(such as hyperthreads) should be counted as cores. Only applicable on Linux when yarn.nodemanager.resource.cpu-vcores is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true.
</description>
<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
<value>false</value>
</property>
<!-- 虛拟核數和實體核數乘數,預設是 1.0 -->
<property>
<description>Multiplier to determine how to convert phyiscal cores to vcores. This value is used if yarn.nodemanager.resource.cpu-vcores is set to -1(which implies auto-calculate vcores) and yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be calculated as number of CPUs * multiplier.
</description>
<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
<value>1.0</value>
</property>
<!-- NodeManager 使用記憶體數,預設 8G,修改為 4G 記憶體 -->
<property>
<description>Amount of physical memory, in MB, that can be allocated for containers. If set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true, it is automatically calculated(in case of Windows and Linux). In other cases, the default is 8192MB.
</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<!-- nodemanager 的 CPU 核數,不按照硬體環境自動設定時預設是 8 個,修改為 4 個 -->
<property>
<description>Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of CPUs used by YARN containers. If it is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
automatically determined from the hardware in case of Windows and Linux. In other cases, number of vcores is 8 by default.
</description>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<!-- 容器最小記憶體,預設 1G -->
<property>
<description>The minimum allocation for every container request at the RM in MBs. Memory requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have
less memory than this value will be shut down by the resource manager.
</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!-- 容器最大記憶體,預設 8G,修改為 2G -->
<property>
<description>The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException.
</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!-- 容器最小 CPU 核數,預設 1 個 -->
<property>
<description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the
resource manager.
</description>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<!-- 容器最大 CPU 核數,預設 4 個,修改為 2 個 -->
<property>
<description>The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an InvalidResourceRequestException.
</description>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
</property>
<!-- 虛拟記憶體檢查,預設打開,修改為關閉 -->
<property>
<description>Whether virtual memory limits will be enforced for containers.</description>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 虛拟記憶體和實體記憶體設定比例,預設 2.1 -->
<property>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.
</description>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
(2)伺服器分發配置檔案
rsync -av 分發的檔案名稱 使用者名@主機名稱:儲存配置檔案位址
10.3.5 執行程式
(1)重新開機叢集
sbin/stop-yarn.sh
sbin/start-yarn.sh
(2)執行 WordCount 程式
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /wcinput /wcoutput
說明:在hadoop檔案夾下運作指令,/input 為要統計的 1G 資料所在的檔案夾目錄,/output 為要輸出統計結果的檔案夾目錄。
(3)觀察 Yarn 任務執行頁面
網址:hadoop103:8088
(4)運作結果
/wcinput/work.txt原内容:

運作結果:生成檔案夾/wcoutput