HIVE 的 HQL 總結

1. group by 與 where 與 having

A. having隻能用在group by之後，對分組後的結果進行篩選(即使用having的前提條件是分組)。使用having字句對分組後的結果進行篩選

B. where肯定在group by 之前,當結合在一起時，where在前，group by 在後。即先對select xx from xx的記錄集合用where進行篩選，然後再使用group by 對篩選後的結果進行分組

C. where後的條件表達式裡不允許使用聚合函數，而having可以

2. 排序 order by ，sort by , ,distribute by, cluster by三者差別

A. order by : order by 字段名1 asc/desc ,會對輸入做全局排序，是以隻有一個reducer，會導緻當輸入規模較大時，需要較長的計算時間。

B. sort by : 不是全局排序，其在資料進入reducer前完成排序。是以，如果用sort by進行排序，并且設定mapred.reduce.tasks>1，則sort by隻保證每個reducer的輸出有序，不保證全局有序.

C. distribute by根據distribute by指定的内容将資料分到同一個reducer。

D. Cluster by 除了具有Distribute by的功能外，還會對該字段進行排序。是以，常常認為cluster by = distribute by + sort by

3. JOIN 總結

join

A. left join：以左表為基準，即對于A left join B on A.x=B.x這句SQL來說，查詢的結果是：表A中的結果會全部顯示出來，但是如果表A中含有的x選項，但是表B卻沒有，這時會在結果集中B的x欄目顯示null。

tb1、tb2兩表共有 + tb1的獨有
select * from tb1 a left join tb2 b on a.deptId = b.id;

B. right join：以右表為基準，即對于A right join B on A.x=B.x這句SQL來說，查詢的結果是：表B中的結果會全部顯示出來，但是如果表B中含有的x選項，表A中沒有，這時會在結果集中A的x欄目顯示null。

tb1、tb2兩表共有+tb2的獨有
select * from tb1 a right join tb2 b on a.deptId = b.id;

C. inner join：隻有兩者共有的字段才會在結果集中顯示

ta1與ta2共同擁有的原生
 select * from tb1 a inner join tb2 b on a.deptId = b.id;

D. ta1的獨有 Select * from tb1 a left join tb2 b on a.deptid = b.id and b.id is null;

E. ta2的獨有 Select * from tb1 a right join tb2 b on a.deptid = b.id and a.id is null;

F. full outer join： tab1與tab2都有 select * from tb1 a full outer join tb2 b on a.deptid = b.id

G. A的獨有+B的獨有

select * from tbl_emp a left join tbl_dept b on a.deptid = b.id where b.id is null
 Union  
 select * from tbl_emp a right join tbl_dept b on a.deptid = b.id where a.deptid is null;

4. NULL 與多個group by

A. 一個字段是否為空， is null, is not null

B. 多個字段GROUP BY， group by 字段1，字段2，字段3

C. 聚合函數， sum（字段） as 别名

5.建立臨時表----> 自連接配接視圖-----> 進行比較統計執行個體

視圖（View）

視圖（view）是一種虛拟存在的表，是一個邏輯表，本身并不包含資料。作為一個select語句儲存在資料字典中的。

1）簡單：使用視圖的使用者完全不需要關心後面對應的表的結構、關聯條件和篩選條件，對使用者來說已經是過濾好的複合條件的結果集。

2）安全：使用視圖的使用者隻能通路他們被允許查詢的結果集，對表的權限管理并不能限制到某個行某個列，但是通過視圖就可以簡單的實作。

總而言之，使用視圖的大部分情況是為了保障資料安全性，提高查詢效率。

對業務邏輯複雜的，查詢過程中可以建立一個臨時表

eg: 求單月通路次數和總通路次數

使用者名，月份，通路次數

A,2015-01,5
 A,2015-01,15
 B,2015-01,5
 A,2015-01,8
 B,2015-01,25
 A,2015-01,5
 A,2015-02,4
 A,2015-02,6
 B,2015-02,10
 B,2015-02,5
 A,2015-03,16
 A,2015-03,22
 B,2015-03,23
 B,2015-03,10
 B,2015-03,1

(1) 建立表導入資料

//建立表
 create external table if not exists t_access(
 uname string comment '使用者名',
 umonth string comment '月份',
 ucount int comment '通路次數'
 ) comment '使用者通路表' 
 row format delimited fields terminated by "," 
 //導入資料
 load data local inpath "/zhm/hiv/access.txt" into table t_access;

結果需求

每個使用者截止到每月為止的最大單月通路次數和累計到該月的總通路次數，結果資料格式如下

HIVE 的 HQL 總結

解決方案

（1）. 根據使用者+月份進行分組, 建立臨時表，先求出當月通路次數

--求當月通路次數
 create table tmp_access(
 name string,
 mon string,
 num int
 ); 
 
 insert into table tmp_access 
 select uname,umonth,sum(ucount)
 from t_access t group by t.uname,t.umonth;

（2）. 要統計每個月的和，需要tmp_access進行自連接配接視圖，然後進行比較聚合

create view tmp_view as 
	select a.name anme,a.mon amon,a.num anum,b.name bname,b.mon bmon,b.num 
	bnum from tmp_access a join tmp_access b 
	on a.name=b.name;

（3）. 進行比較統計

select anme,amon,anum,max(bnum) as max_access,sum(bnum) as sum_access 
	from tmp_view 
	where amon>=bmon 
	group by anme,amon,anum;

5. hive中的concat，concat_ws，collect_set用法

A. concat 與 concat_ws

concat(string s1, string s2, string s3)把字元串類型的資料連接配接起來

concat( name, ‘:’, age) 就相當于把 name 列和 age 列用逗号連接配接起來了

concat_ws(’,’,數組) 相當于把數組中的元素，用逗号連接配接起來了

注意：concat需要的是字元串，有時候需要使用cast将資料類型轉換成string

B. cast 強制轉換， cast(value as type)

eg: cast(字段 as string)

C. collect_set函數與group by有兩個作用:

第一個是去重，去除 group by 後的重複元素，

第二個是形成一個集合，将 group by 後屬于同一組的第三列集合起來成為一個集合。

Hive不允許直接通路非group by字段；
對于非group by字段，可以用Hive的collect_set函數收集這些字段，傳回一個數組；
使用數字下标，可以直接通路數組中的元素；

D. collect_set函數與group by 與 contact_ws結合使用

就是将這些元素以逗号分隔形成字元串

有表結構 a string , b string , c int
e f 1
d k 4
e f 1
d k 8
select a, b, concat_ws(‘,’ , collect_set(cast(c as string)))from table group by a,b;

E. 在Hive中求出一個資料表中在某天内首次登陸的人；可以借助collect_set來處理sql:

select count(a.id)   
from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) 
as a    where size(a.t)=1 and a.t[0]='201809‘’

6. hive substr函數

substr（key, a,b)函數, 從第a個開始，取後面b個

eg: 執行個體：求每一年最大氣溫的那一天 + 溫度

2010012325表示在2010年01月23日的氣溫為25度

2001010529

2013010619

2013010722

2013010812

2013010929

2013011023

select substr(datestr,1,4),max(substr(datestr,9,2)) from table group by substr(datestr,1,4);

7. Hive 中 case when 的用法執行個體

A. case x when y then z then a end

select id,case course when ‘english’ then ‘chinese’ else ‘其他’ end as id from test;

B. case when 條件 then result else result end

case when birthday<‘1968’ then ‘50up’

when birthday<‘1978’ then ‘40-50’

when birthday<‘1983’ then ‘35-40’

when birthday<‘1988’ then ‘30-35’

when birthday<‘1992’ then ‘26-30’

when birthday<‘1998’ then ‘20-25’

else ‘other’ end,

C.執行個體

一個成績表： // 字段解釋：id, 學号，課程，成績

INSERT INTO

course

VALUES (3, 2, ‘yuwen’, 77);

INSERT INTO

course

VALUES (4, 2, ‘shuxue’, 88);

INSERT INTO

course

VALUES (5, 3, ‘yuwen’, 98);

INSERT INTO

course

VALUES (6, 3, ‘shuxue’, 65);

求：所有數學課程成績大于國文課程成績的學生的學号

解決問題思維，建立一張臨時表，學号，數學成績，國文成績

create view tmp_course_view as

select sid, case course when “shuxue” then score else 0 end as shuxue,

case course when “yuwen” then score else 0 end as yuwen from course;

以sid分組合并取各成績最大值

create view tmp_course_view1 as

select aa.sid, max(aa.shuxue) as shuxue, max(aa.yuwen) as yuwen from tmp_course_view aa group by sid;

比較結果

select * from tmp_course_view1 where shuxue > yuwen;

8. get_json_object 内置函數###

解析json的字元串json_string,傳回path指定的内容。如果輸入的json字元串無效，那麼傳回NULL。

get_json_object(’{“movie”:“594”,“rate”:“4”,“timeStamp”:“978302268”,“uid”:“1”}’,’$.movie’);

9. 日期内置函數

date_format ：按指定格式傳回時間date 如：date_format(“2016-06-22”,“MM-dd”)=06-22
date_add(‘2008-12-31’, 1) = ‘2009-01-01’. 從開始時間startdate加上days
date_sub(‘2008-12-31’, 1) = ‘2008-12-30’.從開始時間startdate減去days
datediff(‘2009-03-01’, ‘2009-02-27’) = 2.計算開始時間startdate到結束時間enddate相差的天數
to_date(“1970-01-01 00:00:00”) = “1970-01-01”.傳回時間字元串的日期部分
unix_timestamp(‘2009-03-20’, ‘yyyy-MM-dd’) = 1237532400， unix_timestamp(‘2009-03-20 11:30:01’) = 1237573801
from_unixtime(1250111000,“yyyy-MM-dd”) 得到2009-03-12
year(“1970-01-01”) = 1970. 傳回時間字元串的年份部分
quarter(‘2015-04-08’) = 2. 傳回目前時間屬性哪個季度如quarter(‘2015-04-08’) = 2
month(“1970-11-01”) = 11.傳回時間字元串的月份部分
day(“1970-11-01”) = 1.傳回時間字元串的天
hour(‘12:58:59’) = 12.傳回時間字元串的小時
minute(string date) 傳回時間字元串的分鐘
second(string date) 傳回時間字元串的秒

HIVE 的 HQL 總結