總體優點:檔案小,加載資料量變少
第一部分:隻用Parque格式存儲資料
- 建立表,使用列式存儲Parquet格式存儲資料
- 導入的資料檔案格式:必須是Parquet格式
CREATE TABLE db_yhd.track_log_parquet(
id STRING,
url STRING,
......
linkPosition STRING,
buttonPosition STRING
)
PARTITIONED BY (date_str STRING ,hour_str STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS PARQUET;
注:與非parquet表創表差別在最後一行
- 加載資料到鍊式存儲的分區表中
INSERT INTO TABLE track_log_parquet
PARTITION ( date_str="20150828" , hour_str="18" )
select
id,url,referer,keyword,type,guid,pageId,moduleId,linkId,attachedInfo,sessionId,trackerU,trackerType,ip,trackerSrc,cookie,orderCode,trackTime,endUserId,firstLink,sessionViewNo,productId,curMerchantId,provinceId,cityId,fee,edmActivity,edmEmail,edmJobId,ieVersion,platform,internalKeyword,resultSum,currentPage,linkPosition,buttonPosition
from
db_yhd.track_log
where
date_str="20150828" and hour_str="18"
INSERT INTO TABLE track_log_parquet
PARTITION ( date_str="20150828" , hour_str="19" )
select
id,url,referer,keyword,type,guid,pageId,moduleId,linkId,attachedInfo,sessionId,trackerU,trackerType,ip,trackerSrc,cookie,orderCode,trackTime,endUserId,firstLink,sessionViewNo,productId,curMerchantId,provinceId,cityId,fee,edmActivity,edmEmail,edmJobId,ieVersion,platform,internalKeyword,resultSum,currentPage,linkPosition,buttonPosition
from
db_yhd.track_log
where
date_str="20150828" and hour_str="19"
- 測試,對比鍊式存儲和不是鍊式存儲的資料結果是否改變,以及大小的變動
—非鍊式存儲:
select
date_str,hour_str,count(url) as pv
from
db_yhd.track_log
where
date_str="20150828" and length(trim(url))>0
group by
date_str,hour_str
—鍊式存儲:
select
date_str,hour_str,count(url) as pv
from
db_yhd.track_log_parquet
where
date_str="20150828" and length(trim(url))>0
group by
date_str,hour_str
現象:查詢的結果相同,但map的輸出大小不同,鍊式存儲明顯小很多
第二部分:Parquet列式存儲+snappy壓縮
創表時,在parquet表的基礎上設定壓縮格式
set parquet.compression=snappy;
後面同上,最後現象:
查詢的結果相同,但map的輸出大小不同,parquet鍊式存儲+snappy比parquet列式存儲更小
第三部分:orc+snappy
創表時也可以設定壓縮格式
CREATE TABLE db_yhd.track_log_orc_snappy(
id STRING,
url STRING,
......
linkPosition STRING,
buttonPosition STRING
)
PARTITIONED BY (date_str STRING ,hour_str STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY") ;
注:
STORED AS ORC:聲明orc列式存儲
TBLPROPERTIES (“orc.compress”=“SNAPPY”) :聲明snappy壓縮