天天看点

hive窗口函数经典案列--over()函数

每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数

原始数据
三个字段的意思:
用户名,月份,访问次数
A,2015-01,5 
A,2015-01,15 
B,2015-01,5 
A,2015-01,8 
B,2015-01,25 
A,2015-01,5 


目标数据
用户名,月份,访问次数  最大访问次数  总访问次数
A 1 20 20 20 
A 2 10 30 20 
...
B 4 50 82 82
           

解决方法:

方法一:

select name, 
month,
 pv, 
sum(pv) over (partition by name order by month asc rows between unbounded preceding and current row) as spv,
max(pv) over (partition by name order by month asc rows between unbounded preceding and current row) as mpv from exercise_pv_temp;

第一个要点:sum max count min avg partition by name order by month asc rows between unbounded preceding and current row partition by name:严格来说是分区,事实上,你完全可以理解成是分组
order by month asc: 每组数据按照month升序排序 

rows between A and B: 到底哪些记录作为一组来计算,添加一个窗口的边界

A:unbounded preceding (从起始数据开始)
3 preceding (当前记录不算,往前数3条) 
current row 

B: current row 
3 following (当前记录不算,往后数3条)
unbounded following 到这一组的最后为止 

row between 5 preceding and 2 following 前5条记录 当前记录 后2条记录 这个窗口的长度是:8
           

方法二(笛卡尔积,不推荐)

create table exercise_pv_temp as select name, month, sum(pv) as pv from exercise_pv group by name, month;

select aa.bname, aa.bmonth, aa.bpv, sum(aa.apv) as sumpv, max(aa.apv) as maxpv from ( select a.name as aname, a.month as amonth, a.pv as apv, b.name as bname, b.month as bmonth, b.pv as bpv from exercise_pv_temp a join exercise_pv_temp b on a.name = b.name ) aa where aa.amonth <= aa.bmonth group by aa.bname, aa.bmonth, aa.bpv order by aa.bname, aa.bmonth;