1. Hive 的交互方式

第一种交互方式：bin/hive

第二种交互方式：使用 sql 语句或者 sql 脚本进行交互

2. Hive 的基本操作

2.1 数据库操作

创建数据库：

create database if not exists myhive;

创建数据库并指定位置：

create database myhive location '/myhive';

设置数据库键值对信息：

create database foo with dbproperties ('owner'='itcast','date'='20190120');

查看数据库更多详细信息：

desc database extended myhive;

删除数据库：

drop database myhive;

强制删除数据库，包含数据库下面的表一起删除：（效果相当于 rm -rf，慎用！！！）

drop database myhive cascade;

2.2 数据库表的操作

创建表的语法

create [external] table [if not exists] table_name (
col_name data_type [comment '字段描述信息']
col_name data_type [comment '字段描述信息'])
[comment '表的描述信息']
[partitioned by (col_name data_type,...)]
[clustered by (col_name,col_name,...)]
[sorted by (col_name [asc|desc],...) into num_buckets buckets]
[row format row_format]
[storted as ....]
[location '指定表的路径']

说明：

create table

创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 IF NOT EXISTS 选项来忽略这个异常。

external

可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

comment

表示注释,默认不能使用中文

partitioned by

表示使用表分区,一个表可以拥有一个或者多个分区，每一个分区单独存在一个目录下 .

clustered by

对于每一个表分文件， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。

sorted by

指定排序字段和排序规则

row format

指定表文件字段分隔符

storted as

指定表文件的存储格式, 常用格式:SEQUENCEFILE, TEXTFILE, RCFILE,如果文件数据是纯文本，可以使用 STORED AS TEXTFILE。如果数据需要压缩，使用 storted as SEQUENCEFILE。

location

指定表文件的存储路径

Hive 建表字段类型

分类	类型	描述	字面量示例
原始类型	BOOLEAN	true/false	TRUE
TINYINT	1字节的有符号整数, -128~127	1Y
SMALLINT	2个字节的有符号整数，-32768~32767	1S
INT	4个字节的带符号整数	1
BIGINT	8字节带符号整数	1L
FLOAT	4字节单精度浮点数	1.0
DOUBLE	8字节双精度浮点数	1.0
DEICIMAL	任意精度的带符号小数	1.0
STRING	字符串，变长	“a”,’b’
VARCHAR	变长字符串	“a”,’b’
CHAR	固定长度字符串	“a”,’b’
BINARY	字节数组	无法表示
TIMESTAMP	时间戳，毫秒值精度	122327493795
DATE	日期	‘2016-03-29’
INTERVAL	时间频率间隔
复杂类型	ARRAY	有序的的同类型的集合	array(1,2)
MAP	key-value,key必须为原始类型，value可以任意类型	map(‘a’,1,’b’,2)
STRUCT	字段集合,类型可以不同	struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
UNION	在有限取值范围内的一个值	create_union(1,’a’,63)

建表入门:

use myhive;
create table stu(id int,name string);
insert into stu values (1,"zhangsan");  #插入数据
select * from stu;

创建表并指定字段之间的分隔符

create  table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t';

创建表并指定表文件的存放路径

create  table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' location '/user/stu2';

根据查询结果创建表

# 通过复制表结构和表内容创建新表
create table stu3 as select * from stu2;

根据已经存在的表结构创建表

create table stu4 like stu;

外部表的操作

外部表因为是指定其他的hdfs路径的数据加载到表当中来，所以hive表会认为自己不完全独占这份数据，所以删除hive表的时候，数据仍然存放在hdfs当中，不会删掉.

内部表和外部表的使用场景

每天将收集到的网站日志定期流入HDFS文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过SELECT+INSERT进入内部表。

创建老师表

create external table teacher (t_id string,t_name string) row format delimited fields terminated by '\t';

创建学生表

create external table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by '\t';

加载数据

load data local inpath '/export/servers/hivedatas/student.csv' into table student;

加载数据并覆盖已有数据

load data local inpath '/export/servers/hivedatas/student.csv' overwrite  into table student;

从hdfs文件系统向表中加载数据（需要提前将数据上传到hdfs文件系统）

cd /export/servers/hivedatas
hdfs dfs -mkdir -p /hivedatas
hdfs dfs -put techer.csv /hivedatas/
load data inpath '/hivedatas/teacher.csv' into table teacher;

分区表的操作

在大数据中，最常用的一种思想就是分治，我们可以把大的文件切割划分成一个个的小的文件，这样每次操作一个小的文件就会很容易了，同样的道理，在hive当中也是支持这种思想的，就是我们可以把大的数据，按照每月，或者天进行切分成一个个的小的文件,存放在不同的文件夹中.

创建分区表语法

create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

创建一个表带多个分区

create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

加载数据到分区表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

加载数据到多分区表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

多分区表联合查询(使用

union all

)

select * from score where month = '201806' union all select * from score where month = '201806';

查看分区

show  partitions  score;

添加一个分区

alter table score add partition(month='201805');

删除分区

alter table score drop partition(month = '201806');

分区表综合练习

需求描述

：

现在有一个文件score.csv文件，存放在集群的这个目录下/scoredatas/month=201806，这个文件每天都会生成，存放到对应的日期文件夹下面去，文件别人也需要公用，不能移动。需求，创建hive对应的表，并将数据加载到表中，进行数据统计分析，且删除表之后，数据不能删除

数据准备

：

hdfs dfs -mkdir -p /scoredatas/month=201806
hdfs dfs -put score.csv /scoredatas/month=201806/

创建外部分区表，并指定文件数据存放目录

create external table score4(s_id string, c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t' location '/scoredatas';

进行表的修复

(建立表与数据文件之间的一个关系映射)

msck repair table score4;

分桶表操作

分桶，就是将数据按照指定的字段进行划分到多个文件当中去,分桶就是MapReduce中的分区.

开启 Hive 的分桶功能

set hive.enforce.bucketing=true;

设置 Reduce 个数

set mapreduce.job.reduces=3;

创建分桶表

create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

桶表的数据加载，由于通标的数据加载通过hdfs dfs -put文件或者通过load data均不好使，只能通过insert overwrite

创建普通表，并通过insert overwriter的方式将普通表的数据通过查询的方式加载到桶表当中去

创建普通表

create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

普通表中加载数据

load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;

通过insert overwrite给桶表中加载数据

insert overwrite table course select * from course_common cluster by(c_id);

修改表结构

重命名:

alter  table  old_table_name  rename  to  new_table_name;

把表score4修改成score5

alter table score4 rename to score5;

增加/修改列信息:

查询表结构

desc score5;

添加列

alter table score5 add columns (mycol string, mysco int);

更新列

alter table score5 change column mysco mysconew int;

删除表

drop table score5;

Hive表中加载数据

直接向分区表中插入数据

create table score3 like score;

insert into table score3 partition(month ='201807') values ('001','002','100');

通过load方式加载数据

load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

通过查询方式加载数据

create table score4 like score;
insert overwrite table score4 partition(month = '201806') select s_id,c_id,s_score from score;

Hive（数据仓库） Hive 的交互方式和基本操作1. Hive 的交互方式2. Hive 的基本操作

1. Hive 的交互方式

第一种交互方式：bin/hive

第二种交互方式：使用 sql 语句或者 sql 脚本进行交互

2. Hive 的基本操作

2.1 数据库操作

2.2 数据库表的操作

创建表的语法

Hive 建表字段类型

外部表的操作

分区表的操作

分区表综合练习

分桶表操作

修改表结构

Hive表中加载数据

继续阅读

【分类算法】什么是分类算法定义分类与聚类分类过程方法

申请评分模型拒绝推断（RI）方法申请评分模型拒绝推断（RI）方法

Sql优化一：sql语句优化

Nacos 2.0 升级前后性能对比压测

hadoop 用MR实现join操作

Centos7 下 Hadoop 2.6.4 分布式集群环境搭建摘要集群准备安装JDK 安装 Hadoop 2.6.4 部署 slaver1-slaver4 启动 hadoop 集群成功了

尚硅谷—韩顺平—图解 Java设计模式（结构型）（55～）

Storm编译打包过程中遇到的一些问题及解决方法

MapReduce的几个企业级经典面试案例MapReduce的几个企业级经典面试案例

9.spark Core 进阶2--Cashe

浅谈企业活动中进行数据分析的重要性

ubuntu14.04下安装hbse1.0.1.1

User Defined Hadoop DataType

Ambari介绍和架构原理

NOSQL安全攻击

win10本地scala和spark安装安装scala安装spark