hive内置函數_Hive的内置函數

定義：

UDF(User-Defined-Function)，使用者自定義函數對資料進行處理。

UDTF(User-Defined Table-Generating Functions) 用來解決輸入一行輸出多行(On-to-many maping) 的需求。

UDAF(User Defined Aggregation Function)使用者自定義聚合函數，操作多個資料行，産生一個資料行。

用法：

1、UDF函數可以直接應用于select語句，對查詢結構做格式化處理後，再輸出内容。

2、編寫UDF函數的時候需要注意一下幾點：

a)自定義UDF需要繼承org.apache.hadoop.hive.ql.UDF。

b)需要實作evaluate函。

c)evaluate函數支援重載。

hive的本地模式：

大多數的Hadoop job是需要hadoop提供的完整的可擴充性來處理大資料的。不過，有時hive的輸入資料量是非常小的。在這種情況下，為查詢出發執行任務的時間消耗可能會比實際job的執行時間要多的多。對于大多數這種情況，hive可以通過本地模式在單台機器上處理所有的任務。對于小資料集，執行時間會明顯被縮短。

如此一來，對資料量比較小的操作，就可以在本地執行，這樣要比送出任務到叢集執行效率要快很多。

配置如下參數，可以開啟Hive的本地模式：

hive> set hive.exec.mode.local.auto=true;(預設為false)

hive内置函數_Hive的内置函數

當一個job滿足如下條件才能真正使用本地模式：

1.job的輸入資料大小必須小于參數：hive.exec.mode.local.auto.inputbytes.max(預設128MB)

2.job的map數必須小于參數：hive.exec.mode.local.auto.tasks.max(預設4)

3.job的reduce數必須為0或者1

hive 中視窗函數row_number,rank,dense_ran,ntile分析函數的用法

示例資料：

sql語句：

select id,

name,

sal,

rank()over(partition by name order by sal desc ) rp,

dense_rank() over(partition by name order by sal desc ) drp,

row_number()over(partition by name order by sal desc) rmp

from f_test

結果：

10 b 17 1 1 1

3 b 13 2 2 2

4 b 12 3 3 3

8 b 11 4 4 4

9 a 16 1 1 1

6 a 15 2 2 2

11 a 14 3 3 3

5 a 14 3 3 4

7 a 13 5 4 5

2 a 12 6 5 6

1 a 10 7 6 7

hive内置函數_Hive的内置函數

ntile

ntile(n)，用于将分組資料按照順序切分成n片，傳回目前切片值

ntile不支援rows between，比如 ntile(2) over(partition by cookieid order by createtime rows between 3 preceding and current row)

如果切片不均勻，預設增加第一個切片的分布

比如需求為:求sal前50%的人

select *from (

select id,

name,

sal,

NTILE(2) over(partition by name order by sal desc ) rn

from f_test

) t where t.rn=1

Hive已定義函數介紹：

1、字元串長度函數：length

文法: length(string A)

傳回值: int

舉例：

hive> select length(‘abcedfg’) fromdual;7

2、字元串反轉函數：reverse

文法: reverse(string A)

傳回值: string

說明：傳回字元串A的反轉結果

舉例：

hive> select reverse(‘abcedfg’) fromdual;

gfdecba

3、字元串連接配接函數：concat

文法: concat(string A, string B…)

傳回值: string

說明：傳回輸入字元串連接配接後的結果，支援任意個輸入字元串

舉例：

hive> select concat(‘abc’,'def’,'gh’) fromdual;

abcdefgh

4、帶分隔符字元串連接配接函數：concat_ws

文法: concat_ws(string SEP, string A, string B…)

傳回值: string

說明：傳回輸入字元串連接配接後的結果，SEP表示各個字元串間的分隔符

舉例：

hive> select concat_ws(‘,’,'abc’,'def’,'gh’) from dual;

abc,def,gh

5、字元串截取函數：substr,substring

文法: substr(string A, int start),substring(string A, int start)

傳回值: string

說明：傳回字元串A從start位置到結尾的字元串

舉例：

hive> select substr(‘abcde’,3) fromdual;

cde

hive> select substring(‘abcde’,3) fromdual;

cde

hive> select substr(‘abcde’,-1) fromdual; (和ORACLE相同)

6、類型轉換

類型轉換：case

select cast(1 as float); --1.0select cast('2016-05-22' as date); --2016-05-22

字元串轉大寫函數：upper,ucase

字元串轉小寫函數：lower,lcase

文法: lower(string A) lcase(string A)

傳回值: string

說明：傳回字元串A的小寫格式

舉例：

hive> select lower(‘abSEd’) fromdual;

absed

hive> select lcase(‘abSEd’) fromdual;

absed

7、左右去除空格函數

左邊去空格函數：ltrim

右邊去空格函數：rtrim

8、正規表達式替換函數：regexp_replace

文法: regexp_replace(string A, string B, string C)

傳回值: string

說明：将字元串A中的符合java正規表達式B的部分替換為C。注意，在有些情況下要使用轉義字元

舉例：

hive> select regexp_replace(‘foobar’, ‘oo|ar’, ”) fromdual;

9、正規表達式解析函數：regexp_extract

文法: regexp_extract(string subject, string pattern, int index)

傳回值: string

說明：将字元串subject按照pattern正規表達式的規則拆分，傳回index指定的字元。注意，在有些情況下要使用轉義字元

舉例：

hive> select regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 1) fromdual;

the

hive> select regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) fromdual;

bar

hive> select regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 0) fromdual;

foothebar

10、URL解析函數：parse_url，parse_url_tuple(UDTF)

文法: parse_url(string urlString, string partToExtract [, string keyToExtract])，parse_url_tuple功能類似parse_url()，但它可以同時提取多個部分并傳回

傳回值: string

說明：傳回URL中指定的部分。partToExtract的有效值為：HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

舉例：

hive> select parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1′, ‘HOST’) from dual;

facebook.com

hive> select parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1', 'QUERY:k2');

v1 v2

11、json解析函數：get_json_object

文法: get_json_object(string json_string, string path)

傳回值: string

說明：解析json的字元串json_string,傳回path指定的内容。如果輸入的json字元串無效，那麼傳回NULL。

舉例：

hive> selectget_json_object(‘{“store”:> {“fruit”:\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],> “bicycle”:{“price”:19.95,”color”:”red”}>},>“email”:”[email protected]_for_json_udf_test.net”,>“owner”:”amy”>}> ‘,’$.owner’) fromdual;

amy

12、集合查找函數: find_in_set

文法: find_in_set(string str, string strList)

傳回值: int

說明: 傳回str在strlist第一次出現的位置，strlist是用逗号分割的字元串。如果沒有找該str字元，則傳回0(隻能是逗号分隔，不然傳回0)

舉例：

hive> select find_in_set(‘ab’,'ef,ab,de’) from dual;

2hive> select find_in_set(‘at’,'ef,ab,de’) from dual;

13、行轉列：explode (posexplode Available as of Hive 0.13.0)

說明：将輸入的一行數組或者map轉換成列輸出

文法：explode(array (or map))

舉例：

hive> select explode(split(concat_ws(',','1','2','3','4','5','6','7','8','9'),',')) fromtest.dual;1

14、多行轉換：lateral view

說明：lateral view用于和json_tuple，parse_url_tuple，split, explode等UDTF一起使用，它能夠将一行資料拆成多行資料，在此基礎上可以對拆分後的資料進行聚合。

舉例：

假設我們有一張表pageAds，它有兩列資料，第一列是pageid string，第二列是adid_list，即用逗号分隔的廣告ID集合：

string pageid

Array adid_list

"front_page"

[1, 2, 3]

"contact_page"

[3, 4, 5]

要統計所有廣告ID在所有頁面中出現的次數。

首先分拆廣告ID：

SELECT pageid, adid

FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;

執行結果如下：

string pageid

int adid

"front_page"

"contact_page"

解釋一下，from後面是你的表名，在表名後面加lateral view explode。。。(你的行轉列sql) ，還必須要起一個别名，我這個字段的别名為sp。然後再看看select後面的 s.*，就是原表的字段，我這裡面隻有一個字段，且為X

多個lateral view的sql類如：

select * from exampletable lateral view explode(col1) mytable1 as mycol1 lateral view explode(mycol1) mytable2 as mycol2;

15、union結果集合并

union将多個select語句的結果集合并為一個獨立的結果集

create table dw_oute_numbs as

select'step1' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='2013-09-20' and request like '/item%'union

select'step2' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='2013-09-20' and request like '/category%'union

select'step3' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='2013-09-20' and request like '/order%'union

select'step4' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='2013-09-20' and request like '/index%';

+---------------------+----------------------+--+

| dw_oute_numbs.step | dw_oute_numbs.numbs |

+---------------------+----------------------+--+

| step1 　　　　　　　　| 1029 　　　　　　　　　|

| step2 　　　　　　　　| 1029 　　　　　　　　　|

| step3 　　　　　　　　| 1028 　　　　　　　　　|

| step4 　　　　　　　　| 1018 　　　　　　　　　|

+---------------------+----------------------+--+

抽取一行資料轉換到新表的多列樣例：

http_referer是擷取的帶參數請求路徑，其中非法字元用\做了轉義，根據路徑解析出位址，查詢條件等存入新表中，

drop table ifexists t_ods_tmp_referurl;

create table t_ ods _tmp_referurlasSELECT a.*,b.*FROM ods_origin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer,"\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id;

複制表，并将時間截取到日：

drop table ifexists t_ods_tmp_detail;

create table t_ods_tmp_detailas

select b.*,substring(time_local,0,10) asdaystr,

substring(time_local,11) astmstr,

substring(time_local,5,2) asmonth,

substring(time_local,8,2) asday,

substring(time_local,11,2) ashour

From t_ ods _tmp_referurl b;

hive内置函數_Hive的内置函數

繼續閱讀

hive 自定義函數UDF

大資料學習筆記-------------------(26)

Hive内置函數大全 Hive内置函數大全