hive排序，分区表，分桶表，hive函数

1 排序

1.1 Order By 全局排序

关注点: 只有一个reducer，也就是只有一个分区.

1.2 Sort By Reducer内部排序/区内排序

关注点: 有多个reducer，也就是有多个分区

注意点: 有多个reducer,单独使用sort by, 数据会被随机分到每个reducer中，在每个reducer中sort by会将数据排序。

insert overwrite local directory '/opt/module/hive/datas/sort-result/'
       select * from emp sort by deptno desc ;

1.3 Distribute By 分区

关注点: 指定按照哪个字段分区

insert overwrite local directory '/opt/module/hive/datas/distribute-result/' 
       select * from emp distribute by deptno  sort by  empno desc ;

1.4 Cluster By 分区排序

关注点: 相当于distribute by 和sort by同时用，并且分区和排序的字段是同一个，并且排序是升序的情况.

select * from emp distribute by deptno sort by deptno asc ; 
       select * from emp cluster by deptno ;

2. 分区表

2.1 问题: Hive没有索引的概念，会暴力扫描整个数据.

2.2 本质: Hive的分区表，实际就是分目录，通过多个目录维护整个数据.

2.3 创建分区表(通过dept数据模拟日志数据)

dept_20200401.log

dept_20200402.log

dept_20200403.log

create table dept_partition (
          deptno int, dname string, loc string
       )
       partitioned by (day string)  -- 指定表的分区字段是day,该字段的类型是string
       row format delimited fields terminated by '\t' ;

       load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
       load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
       load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');

查分区数据

2.4 分区表的分区的操作:

1) 查看分区表有多少个分区

show partitions 表名.

增加分区

增加单个分区:

alter table dept_partition add partition(day='20200404');
	 增加多个分区:
	 alter table dept_partition add partition(day='20200405') partition(day='20200406');

删除分区

删除单个分区:
alter table dept_partition drop partition(day='20200404');
删除多个分区:
alter table dept_partition drop partition(day='20200405'), partition(day='20200406');

2.5 二级分区

create table dept_partition2 (
          deptno int, dname string, loc string
       )
       partitioned by (day string,hour string)  
       row format delimited fields terminated by '\t' ;
       
       load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200402',hour='02');
       load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition2 partition(day='20200402',hour='03');
       load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition2 partition(day='20200402',hour='04');

2.6 分区与数据产生关联的方式:

手动创建分区目录，执行分区的修复

创建分区目录

hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition/day=20200404

上传数据到分区目录

hadoop fs -put dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition/day=20200404

在hive中修复分区

msck repair table dept_partition
手动创建分区目录，在hive中添加对应的分区

创建分区目录

hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition/day=20200405

上传数据到分区目录

hadoop fs -put dept_20200402.log /user/hive/warehouse/mydb.db/dept_partition/day=20200405

在Hive中手动添加分区

alter table dept_partition add partition(day=‘20200405’)
手动创建分区目录,在hive中load数据到对应的分区

创建分区目录

hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition/day=20200406

在hive中load数据到指定的分区

load data local inpath ‘/opt/module/hive/datas/dept_20200403.log’ into table dept_partition partition(day=‘20200406’) ;

2.7 动态分区

1) 创建动态分区表

create table dept_dy_partition (
  deptno int, dname string
  )
partitioned by (loc string)  
row format delimited fields terminated by '\t' ;

往动态分区插2入数据

a.
    insert into table dept_dy_partition values(11,'TEST',1000);
	b.
	insert into table dept_dy_partition partition(loc) select * from dept ; 
    c.
	load data local inpath '/opt/module/hive/datas/dept.txt' into table dept_dy_partition ;

3. 分桶表

3.1 分桶表: 分桶表是将数据文件分成多份，每份对应一个桶

3.2 创建分桶表

create table stu_buck
   (
   id int , name string
   )
   clustered by(id)
   into 4 buckets
   row format delimited fields terminated by '\t';
   
   load  data  inpath '/student.txt' into table stu_buck ;

3.3 load数据到分桶表需要注意的点:

reduce的个数设置为-1,让Job自行决定需要用多少个reduce

或者将reduce的个数设置为大于等于分桶表的桶数。
直接将数据放到hdfs后再进行load操作.
不要使用本地模式

3.4 insert方式将数据导入分桶表

insert into table stu_buck select * from student_insert ;

4. 函数

4.1 查看系统内置函数

show functions ;

4.2 查看函数如何使用

desc function 函数名

desc function extended 函数名

4.3 常用函数

1)nvl

关闭本地模式

set hive.exec.mode.local.auto=false;

hive排序，分区表，分桶表，hive函数

1 排序

2. 分区表

3. 分桶表

4. 函数

继续阅读

宝塔面板mysql恢复2018.1.8更新

Centos7 MySQL 5.7 安装MySQL 5.7 安装

查找入职员工时间排名倒数第三的员工所有信息

Hibernate使用Hibernate的“3个准备，7个步骤”Hibernate API简介操作实体对象对象识别

云计算面试题——mysql/存储引擎/备份

SQL语言基础：常用的数据查询语句

MapReduce的几个企业级经典面试案例MapReduce的几个企业级经典面试案例

大数据排错SparkSpark集群启动时候，JAVA_HOME is not sethadoop集群，某台服务器jps无任何输出IDEAkafkahadoopspark sqlfile permissionsIDEA本地测试 - OutOfMemoryError: GC overhead limit exceededhdfs负载均衡

Ubuntu16.04安装Apache+MySQL+PHP1. 安装Apache2. 安装MySQL3. 安装PHP4. 安装phpMyAdmin

ubuntu14.04下安装hbse1.0.1.1

MySQL的4种隔离级别？出现问题

User Defined Hadoop DataType

neo4j之cypher使用文档

Ambari介绍和架构原理

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mysql使用source命令导入.sql文件