HiveSQL常用技巧總結

HiveSQL常用技巧

- 一、去重技巧 group by <-- distinct
- - 在2019年購買後又退款的使用者
- 二、聚合技巧--利用視窗函數grouping sets/cube/rollup
- - 1. grouping sets
  - 使用者性别分布及每個性格的城市分布
  - 2. cube
  - 性别、城市、等級的各種組合的使用者分布
  - 3. rollup
  - 同時計算出每個月的支付金額，以及每年的支付金額
- 三、union all 時可以開啟并發執行
- - 每個使用者的支付和退款金額彙總
- 四、使用lateral view 進行行列轉換
- - 每個品類的購買使用者數
- 五、表連接配接優化
- 六、解決資料傾斜
- - 資料傾斜表現
  - 資料傾斜産生原因和解決辦法
- 七、如何計算按月累計去重
- - 用sum() over() 計算按一定周期進行累計求和，如何計算按月累計去重？

一、去重技巧 group by <-- distinct

###取出user_trade表中全部支付使用者

原來寫法：

select distinct user_name
from user_trade
where dt >'0';

優化寫法：

select user_name
from user_trade
where dt >'0'
group by user_name;

執行時長對比：

原有寫法：

HiveSQL常用技巧總結

優化寫法：

HiveSQL常用技巧總結

在2019年購買後又退款的使用者

原有寫法：

select a.user_name
from
	(select distinct  user_name
	from user_trade
	where year(dt)=2019) a 
	join 
	(select distict user_name
	from user_refund 
	where year(dt)=2019 ) b on a.user_name= b.name;

優化寫法：

select a.user_name
from
	(select   user_name
	from user_trade
	where year(dt)=2019
	group by user_name ) a 
	join 
	(select  user_name
	from user_refund 
	where year(dt)=2019
	group by user_name ) b on a.user_name= b.name;

注意：使用場景僅限于去重，不可以用在去重計算中count(distinct ** )

當資料量比較大且重複值比較多的時候，使用group by 去重，再count() 計數，比count(distinct ** )效率高

二、聚合技巧–利用視窗函數grouping sets/cube/rollup

1. grouping sets

使用者的性别分布、城市分布、等級分布

正常寫法：

性别分布：

select sex, count(distinct user_id) from user_info group by sex;

城市分布：

```
	select city,
			count(distinct user_id)
	from user_info
	group by city;
	```
	
	等級分布：
	
	```
	select level,
			count(distinct user_id)
	from user_info
	group by level;
	```
	缺點：分别寫三次SQL,要執行三次，重複工作費時

優化寫法：

select sex,
		city, 
		level,
		count(distinct user_id)
from user_info
grouping sets(sex,city,level);

HiveSQL常用技巧總結

等價于将不同次元的group by 結果進行union all

使用者性别分布及每個性格的城市分布

正常寫法：

性别分布

select sex,
		count(distinct user_id)
from user_info
group by sex;

每個性别的城市分布：

select sex,
		city,
		count(distinct user_id)
from user_info
group by sex, city;

優化寫法：

select sex,
		city,
		count(distinct user_id)
from user_info
group by sex, city
grouping sets (set,(set,city));

HiveSQL常用技巧總結

注意：第二列為空即為性别的分布；

2. cube

cube：根據group by次元的所有組合進行聚合

性别、城市、等級的各種組合的使用者分布

正常寫法：

select sex,
		city,
		level,
		count(distinct user_id)
from user_info
group by sex, city, level
grouping sets (set,city,level, (set,city),(sex,level),(city,level),(sex,city,level));

優化寫法：

select sex,
		city,
		level,
		count(distinct user_id)
from user_info
group by sex, city, level
with cube;

3. rollup

rollup: 以最左側的次元為主，進行層級聚合，是cube的子集

同時計算出每個月的支付金額，以及每年的支付金額

正常寫法：

select a.dt, 
		sum(a.year_amount),
		sum(a.month_amount)
from
	(select substr(dt,1,4) as dt,
			sum(pay_amount) as year_amount,
			0 as month_amount
	from user_trade
	where dt>'0'
	gourp by substr(dt,1,4)
	union all
	select substr(dt,1,7) as dt,
			0 as year_amount,
			sum(pay_amount) as month_amount
	from user_trade
	where dt>'0'
	group by substr(dt,1,7) )  a
group by a.dt;

HiveSQL常用技巧總結

優化寫法：

select  year(dt) as year,
		month(dt) as month,
		sum(pay_amount)
from user_trade
where dt>'0'
group by year(dt),month(dt)
with rollup;

HiveSQL常用技巧總結

三、union all 時可以開啟并發執行

參數設定：set hive.exec.parallel=true 開啟并發執行，提高執行效率

每個使用者的支付和退款金額彙總

select  a.user_name,
		a.pay_amount,
		a.refund_amount
from
	(select  user_name,
	 		sum(pay_amount) as pay_amount,
	 		0 as refund_amount
	from user_trade
	where dt>'0'
	group by user_name
	union all
	select user_name,
			0 as pay_amound,
			sum(refund_amount) as refund_amount
	from user_refund
	where dt>'0'
	group by user_name) a 
group by a.user_name;

時間對比：

未開啟并發：

HiveSQL常用技巧總結

開啟并發：

HiveSQL常用技巧總結

四、使用lateral view 進行行列轉換

每個品類的購買使用者數

select b.category,
		count(distinct a.user_name)
from user_goods_category a
lateral view explode(split(category_detail,',')) b as category
group by b.category;

HiveSQL常用技巧總結

explode:行轉列函數

concat_ws(’,’ , collect_set(column)) :列轉行函數

五、表連接配接優化

小表在前，大表在後

hive 預設查詢最後一個表是大表，會将其他表緩存起來，然後掃描最後那個表
使用相同的連接配接鍵

當對3個或以上表進行join連接配接時，如果每個on 都使用相同的連接配接鍵，那麼隻會産生一個MapReduce job
盡早過濾資料
邏輯複雜時引入中間表

六、解決資料傾斜

資料傾斜表現

任務進度長時間維持在99%（或者100%），任務監控頁面，隻有少量（1個或者幾個）reduce子任務未完成。因為其處理的資料量和其他reduce差異比較大

資料傾斜産生原因和解決辦法

空值産生的資料傾斜

解決：如果兩個表連接配接時，使用的連接配接條件有很多空值時，建議在連接配接條件中增加過濾

列如：on a.user_id=b.user_id and a.user_id is not null
大小表連接配接（一張表很大，另一張表很小）

解決：将小表放在記憶體中，在map端做join

select b.*
from a join b on a.*=ｂ.*;

兩個表連接配接條件的字段資料類型不一緻

解決：兩個表連接配接條件的字段資料類型轉換成一緻的

列如：on a.user_id = cast(b.user_id as string)

七、如何計算按月累計去重

用sum() over() 計算按一定周期進行累計求和，如何計算按月累計去重？

2017年、2018年按月累計去重的購買使用者數

select b.year,
		b.month,
		sum(b.user_nem) over (partition by b.year order by b.month)
from
		(select a.year,
				a.month,
				count(distinct a.user_name) user_num
		from
				(select  year(dt) as dt,
						user_name,
						min(month(dt)) as month
				from user_trade
				where year(dt) in (2017,2018)
				group by year(dt), user_name ) a
		group by a.year, a.month ) b 
order by b.year, b.month
limit 24;

HiveSQL常用技巧總結

第二種寫法:

set hive.mapred.mode=nonstrict;
select 
from 
	(select  substr(dt,1,7 ) as month,
			user_name
	from user_trade
	where year(dt) in (2017,2018)
	group by substr(dt,1,7 ) as month,
			user_name) a
	cross join
	(select 
	from user_trade
	where )

HiveSQL常用技巧總結