天天看點

hive分組排序 取top N

pig可以輕松擷取TOP n。書上有例子

hive中比較麻煩,沒有直接實作的函數,可以寫udf實作。還有個比較簡單的實作方法:
用row_number,生成排名序列号。然後外部分組後按這個序列号多慮,樣例代碼如下
           
select a.*
from(   
    select 品牌,管道,檔期,count/sum/其它() as num row_number() over (partition by 品牌,管道 order by num desc ) rank
    from table_name
    where 品牌,管道 限制條件
    group by 品牌,管道,檔期
    )a
where a.rank<=
           

其實 排序有三個函數

(1)row_number:排序後,順序下來,相同項按先後順序排序,1,2,3,4,5

(2)rank:排序後,遇到資料相同項時序号一緻,後面并留白一位,比如,1,2,2,4,4,6

dense_rank:在遇到資料相同項時,序号一緻,不留白位,如 1,2,2,3,3,4,4,5

具體用例可以參見:http://www.cnblogs.com/dycg/p/4260283.html

我自己設計的代碼

##統計國内,各省份的城市排名
select b.*
from
(select country,
    province,
    city,
    cnt,
    row_number() over (partition by country,province order by cnt desc) rank
from 
    (select country,
            province,
            city,
            count() as cnt
    from tb_pmp_region_report_hive_mapping
    where country = '中國'
    group by country,province,city
    ) a
)b
where b.rank<=
           

表a統計出基本資料,從a中加排名項。然後,按排名項過濾。内部group後,外部不需要group by

需要注意的是,加排名項時,不應該使用group。
如果有group,那麼row_number中的order by項必須是group内的字段,否則報錯,如下段代碼報錯
           
select b.*
from
(select country,
    province,
    city,
    cnt,
    row_number() over (partition by country,province order by cnt desc) rank
from 
    (select country,
            province,
            city,
            count() as cnt
    from tb_pmp_region_report_hive_mapping
    where country = '中國'
    group by country,province,city
    ) a
    group by country,province,city
)b
where b.rank<=
           

執行報錯:

FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.

Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 7:62 Expression not in GROUP BY key ‘cnt’

hive>