hive中的lateral view 与 explode函数，及collect_set函数的使用

大纲：

1、概述

2、explode 使用例子

3、引入lateral view的原因

4、explode与lateral view使用示例1

5、explode与lateral view使用示例2

6、collect_set()函数示例

7、substr()函数示例

8、concat_ws()函数示例

1、概述

explode与lateral view在关系型数据库中本身是不该出现的，因为他的出现本身就是在操作不满足第一范式的数据（每个属性都不可再分），本身已经违背了数据库的设计原理（不论是业务系统还是数据仓库系统），不过大数据技术普及后，很多类似pv，uv的数据，在业务系统中是存贮在非关系型数据库中，用json存储的概率比较大，直接导入hive为基础的数仓系统中，就需要经过ETL过程解析这类数据，explode与lateral view在这种场景下大显身手。

2、explode 使用例子

explode作用是处理map结构的字段,使用案例如下（hive自带map、struct、array字段类型,但是需要先定义好泛型,所以在此案例不使用）:

2.1、建表语句：

drop table explode_lateral_view;

create table explode_lateral_view(

`year_month` string,

`week_sales` string

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '|'

STORED AS textfile;

data.txt 内容如下：

$ cat data.txt

202001|w1_100,w2_125,w3_150,w4_198

202002|w1_210,w2_233,w3_268,w4_192

202003|w1_305,w2_350,w3_372,w4_395

2.2、导入数据并展示数据：

导入数据

hive -e " load data local inpath '/home/hdp-credit/workdir/yaoyingzhe/test_shell/one.txt' into table explode_lateral_view;" 或者

hive> load data local inpath '/home/hdp-bx-ai/workdir/tmp/data.txt' into table explode_lateral_view;

Loading data to table hdp_bx_ai.explode_lateral_view

Table hdp_bx_ai.explode_lateral_view stats: [numFiles=1, totalSize=69]

OK

Time taken: 1.121 seconds

表内数据如下:

hive> select * from explode_lateral_view;

OK

202001   w1_100,w2_125,w3_150,w4_198

202002   w1_210,w2_233,w3_268,w4_192

202003   w1_305,w2_350,w3_372,w4_395

Time taken: 0.325 seconds, Fetched: 3 row(s)

hive> desc explode_lateral_view;

OK

year_month     string

week_sales     string

Time taken: 0.363 seconds, Fetched: 2 row(s)

2.3、explode的使用：

我们只拆解array字段，语句和结果如下:

hive> select year_month,split(week_sales,',') from explode_lateral_view;

OK

202001   ["w1_100","w2_125","w3_150","w4_198"]

202002   ["w1_210","w2_233","w3_268","w4_192"]

202003   ["w1_305","w2_350","w3_372","w4_395"]

Time taken: 0.215 seconds, Fetched: 3 row(s)hive> select year_month,split(week_sales,',')[0] from explode_lateral_view;

OK

202001   w1_100

202002   w1_210

202003   w1_305

Time taken: 0.187 seconds, Fetched: 3 row(s)

hive> select year_month,split(week_sales,',')[1] from explode_lateral_view;

OK

202001   w2_125

202002   w2_233

202003   w2_350

Time taken: 0.184 seconds, Fetched: 3 row(s)

hive> select year_month,split(week_sales,',')[2] from explode_lateral_view;

OK

202001   w3_150

202002   w3_268

202003   w3_372

Time taken: 0.185 seconds, Fetched: 3 row(s)

hive> select year_month,split(week_sales,',')[3] from explode_lateral_view;

OK

202001   w4_198

202002   w4_192

202003   w4_395

Time taken: 0.345 seconds, Fetched: 3 row(s)

3、引入lateral view的原因

3.1、我们想用get_json_object来获取key为week_sales的数据

hive> select year_month,split(week_sales,',') from explode_lateral_view;

OK

202001   ["100","125","150","198"]

202002   ["210","233","268","192"]

202003   ["305","350","372","395"]

hive> select year_month,explode(split(week_sales,',')) as goods_id from explode_lateral_view;

FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

然后挂了FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions UDTF explode不能写在别的函数内；

ps，介绍下hive自定义函数的三种类型：

UDF自定义函数

　　1.首先创建JAVA类，继承UDF.class

　　2.重写evaluate()方法；

　　3.打jar包；

　　4.加载自定义函数的jar包;

　　　　hive>add jar /home/hyxy/XXX.jar ;

　　　　hive>create temporary function {function_name} as 'com.hyxy.hive.udf.xxx'

　　5.自定义函数类型

　　　　a.UDF:单行进-->单行出 split(',',comma_split_str),select field1

　　　　b.UDAF：多行进-->单行出 collect_set()

　　　　c.UDTF：单行进-->多行出 explode函数

使用UDTF的时候，只支持一个字段，这时候就需要LATERAL VIEW出场了

4、explode与lateral view使用示例1

LATERAL VIEW的使用：

侧视图的意义是配合explode（或者其他的UDTF），一个语句生成把单行数据拆解成多行后的数据结果集。

select explode(split(week_sales,',')) as goods_id from explode_lateral_view;

OK

w1_100

w2_125

w3_150

w4_198

w1_210

w2_233

w3_268

w4_192

w1_305

w2_350

w3_372

w4_395

Time taken: 0.191 seconds, Fetched: 12 row(s)

hive> select year_month,w_sale from explode_lateral_view lateral view explode(split(week_sales,',')) tb_all_sale as w_sale;

OK

202001   w1_100

202001   w2_125

202001   w3_150

202001   w4_198

202002   w1_210

202002   w2_233

202002   w3_268

202002   w4_192

202003   w1_305

202003   w2_350

202003   w3_372

202003   w4_395

Time taken: 0.306 seconds, Fetched: 12 row(s)

hive> select year_month,split(w_sale,'_')[0] week_no,split(w_sale,'_')[1] sale_num from explode_lateral_view lateral view explode(split(week_sales,',')) tb_all_sale as w_sale;

OK

202001   w1   100

202001   w2   125

202001   w3   150

202001   w4   198

202002   w1   210

202002   w2   233

202002   w3   268

202002   w4   192

202003   w1   305

202003   w2   350

202003   w3   372

202003   w4   395

Time taken: 0.19 seconds, Fetched: 12 row(s)

其中lateral view explode(split(week_sales,',')) tb_all_sale 相当于一个虚拟表，与原表explode_lateral_view笛卡尔积关联。

它也可以多重使用。

hive> select goods_id2,sale_info,area2 from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2 ;

OK

1 [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] a:shandong

2 [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] b:beijing

Time taken: 0.135 seconds, Fetched: 2 row(s)

最终，我们可以通过下面的句子，把每月对应的四周销售数据的一行数据，完全转换成二维表的方式展现。

hive> select year_month,split(w_sale,'_')[0] week_no,split(w_sale,'_')[1] sale_num from explode_lateral_view lateral view explode(split(week_sales,',')) tb_all_sale as w_sale;

OK

202001   w1   100

202001   w2   125

202001   w3   150

202001   w4   198

202002   w1   210

202002   w2   233

202002   w3   268

202002   w4   192

202003   w1   305

202003   w2   350

202003   w3   372

202003   w4   395

Time taken: 0.19 seconds, Fetched: 12 row(s)

5、尝试将列数据转换行

hive> select * from explode_lateral_view;

OK

202001   w1_100,w2_125,w3_150,w4_198

202002   w1_210,w2_233,w3_268,w4_192

202003   w1_305,w2_350,w3_372,w4_395

Time taken: 0.19 seconds, Fetched: 3 row(s)

hive> select year_month,w_sale from explode_lateral_view lateral view explode(split(week_sales,',')) tb_all_sale as w_sale;

OK

202001   w1_100

202001   w2_125

202001   w3_150

202001   w4_198

202002   w1_210

202002   w2_233

202002   w3_268

202002   w4_192

202003   w1_305

202003   w2_350

202003   w3_372

202003   w4_395

Time taken: 0.192 seconds, Fetched: 12 row(s)

hive>

SELECT year_month

,collect_set(w_sale) as w_sales_set

from

(

select year_month,w_sale from explode_lateral_view lateral view explode(split(week_sales,',')) tb_all_sale as w_sale

) tb_final

group by year_month

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-08-07 10:14:22,820 Stage-1 map = 0%, reduce = 0%

2020-08-07 10:14:39,553 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.45 sec

2020-08-07 10:14:52,248 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.45 sec

MapReduce Total cumulative CPU time: 3 seconds 450 msec

Ended Job = job_1595390479217_783551

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.27 sec HDFS Read: 0 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 5 seconds 270 msec

OK

202001   ["w1_100","w2_125","w3_150","w4_198"]

202002   ["w1_210","w2_233","w3_268","w4_192"]

202003   ["w1_305","w2_350","w3_372","w4_395"]

Time taken: 51.212 seconds, Fetched: 3 row(s)

hive>

SELECT year_month

,concat_ws('@',collect_set(w_sale)) as w_sales_str

from

(

select year_month,w_sale from explode_lateral_view lateral view explode(split(week_sales,',')) tb_all_sale as w_sale

) tb_final

group by year_month

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-08-07 10:15:56,555 Stage-1 map = 0%, reduce = 0%

2020-08-07 10:16:57,202 Stage-1 map = 0%, reduce = 0%

2020-08-07 10:17:05,465 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 4.42 sec

2020-08-07 10:17:07,533 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.08 sec

2020-08-07 10:17:19,963 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.35 sec

MapReduce Total cumulative CPU time: 7 seconds 350 msec

Ended Job = job_1595390479217_783605

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.35 sec HDFS Read: 0 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 7 seconds 350 msec

OK

202001   [email protected][email protected][email protected]_198

202002   [email protected][email protected][email protected]_192

202003   [email protected][email protected][email protected]_395

Time taken: 107.979 seconds, Fetched: 3 row(s)

6、collect_set()函数示例

说到explode()函数就不得不说一下collect_set()函数。collect_set(col)函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生array类型字段。例如，要统计每种no下的score，直接对no分组后对score进行collect_set操作，如下：

hive> desc yyz_hive_callback_user_info;

OK

user_no    string     user_no

click_id     string     click_id

date_str     string     date_str

rule_code    string     rule_code

Time taken: 0.424 seconds, Fetched: 4 row(s)

hive> select collect_set(rule_code) from yz_hive_callback_user_info;

OK

["R00004","E00001","E00002","C00012","C00002","C00014"]

Time taken: 73.458 seconds, Fetched: 1 row(s)

hive> select date_str,collect_set(rule_code) from yz_hive_callback_user_info group by date_str ;

2019-10-10   ["R00003","E00032","E00033","C00037"]

2019-10-11   ["E00024",,"C00005","E00026","C00033","E00022"]

2019-10-12   ["R00008",","C00018","C00031","E00015"]

select no，collect_set(score) from tablss group by no;

这样，就实现了将列转行的功效，但是注意只限同列基本数据类型，函数只能接受一列参数。

7、substr()函数示例

substr()是字符串截取函数，其语法为: substr(string A, int start, int len)，返回值为 string类型。说明：返回字符串A从start位置开始，长度为len的字符串。这里需要注意的是初始位置是从1开始。

hive> select substr(goods_id,1,3),goods_id from explode_lateral_view;

OK

1,2 1,2,3,4,5,6,7,8,9

Time taken: 0.219 seconds, Fetched: 1 row(s)

hive> select substr(goods_id,1,4),goods_id from explode_lateral_view;

OK

1,2, 1,2,3,4,5,6,7,8,9

Time taken: 0.349 seconds, Fetched: 1 row(s)

8、concat_ws()函数示例

hive合并所有电话号码相同的问题内容，用冒号分割

SELECT B.LDHM, concat_ws(':',collect_set(b.WTNR))

FROM (

SELECT A.LDHM, A.DJRQ, A.WTNR

FROM TEST1_12366 A

WHERE A.LDHM IS NOT NULL AND LENGTH(A.LDHM) > 5

ORDER BY A.LDHM, A.DJRQ

) B

GROUP BY B.LDHM;

参考: https://blog.csdn.net/guodong2k/article/details/79459282

https://blog.csdn.net/gdkyxy2013/article/details/78683165

hive中的lateral view 与 explode函数，及collect_set函数的使用

1、概述

2、explode 使用例子

2.1、建表语句：

2.2、导入数据并展示数据：

2.3、explode的使用：

3、引入lateral view的原因

3.1、我们想用get_json_object来获取key为week_sales的数据

ps，介绍下hive自定义函数的三种类型：

4、explode与lateral view使用示例1

5、尝试将列数据转换行

6、collect_set()函数示例

7、substr()函数示例

8、concat_ws()函数示例

继续阅读

Hadoop离线_Hive的基本操作

Hive中内部表、外部表、分区、分桶以及SQL的执行顺序

Hive中的内部表外部表及分区表

Hive---外部分区表的创建

Hive学习笔记 3 Hive的数据模型：内部表、分区表、外部表、桶表、视图

HiveQL(二):分区表

Hive的分区表入门

Hive的分区表

Hive（二）--分区分桶，内部表外部表

大数据高频面试题之Hive的小文件合并

世界因大数据而改变

hive sql通过具体地址解析出行政区划(省＞市＞区＞县＞乡＞镇＞村)

Excel多行转多列

Hive最全常见错误及解决方案hive --service metastore &

《Hive权威指南》第八章：HiveQL索引8 HiveQL：索引

HiveQl语句应用实例：WordCount具体步骤如下：