面試必備技能-HiveSQL優化

王知無大資料技術與架構

Hive SQL基本上适用大資料領域離線資料處理的大部分場景。Hive SQL的優化也是我們必須掌握的技能，而且，面試一定會問。那麼，我希望面試者能答出其中的80%優化點，在這個問題上才算過關。

Hive優化目标
- 在有限的資源下，執行效率更高
常見問題
- 資料傾斜
- map數設定
- reduce數設定
- 其他

* Hive執行

HQL --> Job --> Map/Reduce
執行計劃
explain [extended] hql
樣例
select col,count(1) from test2 group by col;
explain select col,count(1) from test2 group by col;
Hive表優化
分區
- set hive.exec.dynamic.partition=true;
- set hive.exec.dynamic.partition.mode=nonstrict;
- 靜态分區
- 動态分區
分桶
- set hive.enforce.bucketing=true;
- set hive.enforce.sorting=true;
- 相同資料盡量聚集在一起
Hive Job優化
并行化執行
- 每個查詢被hive轉化成多個階段，有些階段關聯性不大，則可以并行化執行，減少執行時間
- set hive.exec.parallel= true;
- set hive.exec.parallel.thread.numbe=8;
本地化執行
- job的輸入資料大小必須小于參數:hive.exec.mode.local.auto.inputbytes.max(預設128MB)
- job的map數必須小于參數:hive.exec.mode.local.auto.tasks.max(預設4)
- job的reduce數必須為0或者1
- set hive.exec.mode.local.auto=true;
- 當一個job滿足如下條件才能真正使用本地模式:
job合并輸入小檔案
- set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
- 合并檔案數由mapred.max.split.size限制的大小決定
job合并輸出小檔案
- set hive.merge.smallfiles.avgsize=256000000;當輸出檔案平均小于該值，啟動新job合并檔案
- set hive.merge.size.per.task=64000000;合并之後的檔案大小
JVM重利用
- set mapred.job.reuse.jvm.num.tasks=20;
- JVM重利用可以使得JOB長時間保留slot,直到作業結束，這在對于有較多任務和較多小檔案的任務是非常有意義的，減少執行時間。當然這個值不能設定過大，因為有些作業會有reduce任務，如果reduce任務沒有完成，則map任務占用的slot不能釋放，其他的作業可能就需要等待。
壓縮資料
- set hive.exec.compress.output=true;
- set mapred.output.compreession.codec=org.apache.hadoop.io.comp ress.GzipCodec;
- set mapred.output.compression.type=BLOCK;
- set hive.exec.compress.intermediate=true;
- set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
- set hive.intermediate.compression.type=BLOCK;
  
  中間壓縮就是處理hive查詢的多個job之間的資料，對于中間壓縮，最好選擇一個節省cpu耗時的壓縮方式
  
  hive查詢最終的輸出也可以壓縮
Hive Map優化
set mapred.map.tasks =10; 無效
(1)預設map個數
- default_num=total_size/block_size;
(2)期望大小
- goal_num=mapred.map.tasks;
(3)設定處理的檔案大小
- split_size=max(mapred.min.split.size,block_size);
- split_num=total_size/split_size;
(4)計算的map個數
compute_map_num=min(split_num,max(default_num,goal_num))
經過以上的分析，在設定map個數的時候，可以簡答的總結為以下幾點：
- 增大mapred.min.split.size的值
- 如果想增加map個數，則設定mapred.map.tasks為一個較大的值
- 如果想減小map個數，則設定mapred.min.split.size為一個較大的值
- 情況1：輸入檔案size巨大，但不是小檔案
- 情況2：輸入檔案數量巨大，且都是小檔案，就是單個檔案的size小于blockSize。這種情況通過增大mapred.min.split.size不可行，需要使用combineFileInputFormat将多個input path合并成一個InputSplit送給mapper處理，進而減少mapper的數量。
map端聚合
- set hive.map.aggr=true;
推測執行
- mapred.map.tasks.apeculative.execution
Hive Shuffle優化
Map端
- io.sort.mb
- io.sort.spill.percent
- min.num.spill.for.combine
- io.sort.factor
- io.sort.record.percent
Reduce端
- mapred.reduce.parallel.copies
- mapred.reduce.copy.backoff
- mapred.job.shuffle.input.buffer.percent
Hive Reduce優化
需要reduce操作的查詢
- group by,join,distribute by,cluster by...
- order by比較特殊,隻需要一個reduce
- sum,count,distinct...
- 聚合函數
- 進階查詢
- mapred.reduce.tasks.speculative.execution
- hive.mapred.reduce.tasks.speculative.execution
- numRTasks = min[maxReducers,input.size/perReducer]
- maxReducers=hive.exec.reducers.max
- perReducer = hive.exec.reducers.bytes.per.reducer
- hive.exec.reducers.max 預設：999
- hive.exec.reducers.bytes.per.reducer 預設:1G
- set mapred.reduce.tasks=10;直接設定
- 計算公式
Hive查詢操作優化
join優化
- 關聯操作中有一張表非常小
- 不等值的連結操作
- set hive.auto.current.join=true;
- hive.mapjoin.smalltable.filesize預設值是25mb
- select /+mapjoin(A)/ f.a,f.b from A t join B f on (f.a=t.a)
- hive.optimize.skewjoin=true;如果是Join過程出現傾斜，應該設定為true
- set hive.skewjoin.key=100000; 這個是join的鍵對應的記錄條數超過這個值則會進行優化
- mapjoin
- 簡單總結下,mapjoin的使用場景:
- 兩個表以相同方式劃分桶
- 兩個表的桶個數是倍數關系
- crete table order(cid int,price float) clustered by(cid) into 32 buckets;
- crete table customer(id int,first string) clustered by(id) into 32 buckets;
- select price from order t join customer s on t.cid=s.id
join 優化前
- select m.cid,u.id from order m join customer u on m.cid=u.id where m.dt='2013-12-12';
join優化後
- select m.cid,u.id from (select cid from order where dt='2013-12-12')m join customer u on m.cid=u.id;
group by 優化
- hive.groupby.skewindata=true;如果是group by 過程出現傾斜應該設定為true
- set hive.groupby.mapaggr.checkinterval=100000;--這個是group的鍵對應的記錄條數超過這個值則會進行優化
count distinct 優化
優化前
- select count(distinct id) from tablename
優化後
- select count(1) from (select distinct id from tablename) tmp;
- select count(1) from (select id from tablename group by id) tmp;
- select a,sum(b),count(distinct c),count(distinct d) from test group by a
- select a,sum(b) as b,count(c) as c,count(d) as d from(select a,0 as b,c,null as d from test group by a,c union all select a,0 as b,null as c,d from test group by a,d union all select a,b,null as c,null as d from test)tmp1 group by a;

面試必備技能-HiveSQL優化

面試必備技能-HiveSQL優化

Hive表優化

Hive Job優化

Hive Map優化

Hive Shuffle優化

Hive Reduce優化

Hive查詢操作優化

join優化

join 優化前

join優化後

group by 優化

count distinct 優化

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method