pig可以輕松擷取TOP n。書上有例子
hive中比較麻煩,沒有直接實作的函數,可以寫udf實作。還有個比較簡單的實作方法:
用row_number,生成排名序列号。然後外部分組後按這個序列号多慮,樣例代碼如下
select a.*
from(
select 品牌,管道,檔期,count/sum/其它() as num row_number() over (partition by 品牌,管道 order by num desc ) rank
from table_name
where 品牌,管道 限制條件
group by 品牌,管道,檔期
)a
where a.rank<=
其實 排序有三個函數
(1)row_number:排序後,順序下來,相同項按先後順序排序,1,2,3,4,5
(2)rank:排序後,遇到資料相同項時序号一緻,後面并留白一位,比如,1,2,2,4,4,6
dense_rank:在遇到資料相同項時,序号一緻,不留白位,如 1,2,2,3,3,4,4,5
具體用例可以參見:http://www.cnblogs.com/dycg/p/4260283.html
我自己設計的代碼
##統計國内,各省份的城市排名
select b.*
from
(select country,
province,
city,
cnt,
row_number() over (partition by country,province order by cnt desc) rank
from
(select country,
province,
city,
count() as cnt
from tb_pmp_region_report_hive_mapping
where country = '中國'
group by country,province,city
) a
)b
where b.rank<=
表a統計出基本資料,從a中加排名項。然後,按排名項過濾。内部group後,外部不需要group by
需要注意的是,加排名項時,不應該使用group。
如果有group,那麼row_number中的order by項必須是group内的字段,否則報錯,如下段代碼報錯
select b.*
from
(select country,
province,
city,
cnt,
row_number() over (partition by country,province order by cnt desc) rank
from
(select country,
province,
city,
count() as cnt
from tb_pmp_region_report_hive_mapping
where country = '中國'
group by country,province,city
) a
group by country,province,city
)b
where b.rank<=
執行報錯:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 7:62 Expression not in GROUP BY key ‘cnt’
hive>