天天看點

hive優化之鍊式存儲加壓縮(Parque+Snappy、ORC+Snappy)

總體優點:檔案小,加載資料量變少

第一部分:隻用Parque格式存儲資料

  • 建立表,使用列式存儲Parquet格式存儲資料
  • 導入的資料檔案格式:必須是Parquet格式
CREATE TABLE db_yhd.track_log_parquet(
id              STRING,
url             STRING,
......
linkPosition    STRING,
buttonPosition  STRING
)
PARTITIONED BY (date_str STRING ,hour_str STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n'
STORED AS PARQUET;
           

注:與非parquet表創表差別在最後一行

  • 加載資料到鍊式存儲的分區表中
INSERT INTO TABLE track_log_parquet 
PARTITION ( date_str="20150828" , hour_str="18" ) 
select
    id,url,referer,keyword,type,guid,pageId,moduleId,linkId,attachedInfo,sessionId,trackerU,trackerType,ip,trackerSrc,cookie,orderCode,trackTime,endUserId,firstLink,sessionViewNo,productId,curMerchantId,provinceId,cityId,fee,edmActivity,edmEmail,edmJobId,ieVersion,platform,internalKeyword,resultSum,currentPage,linkPosition,buttonPosition
from 
   db_yhd.track_log
where 
   date_str="20150828" and hour_str="18"  
           
INSERT INTO TABLE track_log_parquet 
PARTITION ( date_str="20150828" , hour_str="19" ) 
select
    id,url,referer,keyword,type,guid,pageId,moduleId,linkId,attachedInfo,sessionId,trackerU,trackerType,ip,trackerSrc,cookie,orderCode,trackTime,endUserId,firstLink,sessionViewNo,productId,curMerchantId,provinceId,cityId,fee,edmActivity,edmEmail,edmJobId,ieVersion,platform,internalKeyword,resultSum,currentPage,linkPosition,buttonPosition
from 
   db_yhd.track_log
where 
   date_str="20150828" and hour_str="19"  
           
  • 測試,對比鍊式存儲和不是鍊式存儲的資料結果是否改變,以及大小的變動

—非鍊式存儲:

select
   date_str,hour_str,count(url) as pv 
from
    db_yhd.track_log 
where 
    date_str="20150828" and length(trim(url))>0
group by
    date_str,hour_str
           

—鍊式存儲:

select
   date_str,hour_str,count(url) as pv 
from
    db_yhd.track_log_parquet 
where 
    date_str="20150828" and length(trim(url))>0
group by
    date_str,hour_str
           

現象:查詢的結果相同,但map的輸出大小不同,鍊式存儲明顯小很多

第二部分:Parquet列式存儲+snappy壓縮

創表時,在parquet表的基礎上設定壓縮格式

set parquet.compression=snappy;
           

後面同上,最後現象:

查詢的結果相同,但map的輸出大小不同,parquet鍊式存儲+snappy比parquet列式存儲更小

第三部分:orc+snappy

創表時也可以設定壓縮格式

CREATE TABLE db_yhd.track_log_orc_snappy(
id              STRING,
url             STRING,
......
linkPosition    STRING,
buttonPosition  STRING
)
PARTITIONED BY (date_str STRING ,hour_str STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n'
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY") ;
           

注:

STORED AS ORC:聲明orc列式存儲

TBLPROPERTIES (“orc.compress”=“SNAPPY”) :聲明snappy壓縮