一、分區概念:分區最主要的目的:實作快速查詢
為什麼要建立分區:
單個表資料量越來越大的時候,在Hive select查詢中一般會掃描整個表内容(暴力掃描),會消耗很多時間做沒必要的工作。有時候隻需要掃描表中關心的一部分資料,是以建表時引入了partition概念。
(1)、Hive的分區和mysql的分區差異:
mysql分區是将表中的字段拿來直接作為分區字段,而hive的分區則是分區字段不在表中。
(2)、怎麼分區:
根據業務分區,(完全看業務場景)選取id、年、月、日、男女性别、年齡段或者是能平均将資料分到不同檔案中最好,分區不好将直接導緻查詢結果延遲。
(3)、分區細節:
1、一個表可以擁有一個或者多個分區,每個分區以檔案夾的形式單獨存在表檔案夾的目錄下。
2、表和列名不區分大小寫。
3、分區是以字段的形式在表結構中存在,通過describe table指令可以檢視到字段存在(算是一個僞列),但是該字段不存放實際的資料内容,僅僅是分區的表示。
4、分區有一級、二級、三級和多級分區:
5、建立動态分區、靜态分區、混合分區:
動态分區:可以動态加載資料或分區個數不固定
靜态分區:可以靜态加入資料或分區個數确定
混合分區:動态和靜态結合加入資料,主分區必須為靜态分區,副分區可以為動态分區
example:
一、通過load的方式加載靜态分區資料
##看一下我們的資料
[[email protected] partitionData]$ cat pt_stu_nv.txt
095041211011,zhangyan,21
095041211034,xiaoli,18
095041211056,xiaoxin,13
[[email protected] partitionData]$ cat pt_stu_nan.txt
095041211001,xiaozhang,21
095041211029,xiaopeng,18
095041211089,xiaohong,11
##建立分區表,需要注意分區字段不在表字段中
hive (hwzhdb)> create external table static_partition_stu(
> id int,
> name string
> )
> partitioned by (age int)
> row format delimited fields terminated by ','
> stored as textfile;
OK
Time taken: 0.308 seconds
##加載分區資料,必須在表名後面加上分區字段和值。此時我們是加載的本地資料,生産來說的話是需要加載hdfs上的資料,因為資料做清洗操作後是存放在hdfs上面的。
hive (hwzhdb)> load data local inpath '/home/hadoop/data/partitionData/pt_stu_nan.txt' into table static_partition_stu partition(gender='nan');
Loading data to table hwzhdb.static_partition_stu partition (gender=nan)
Partition hwzhdb.static_partition_stu{gender=nan} stats: [numFiles=1, numRows=0, totalSize=76, rawDataSize=0]
OK
Time taken: 1.525 seconds
hive (hwzhdb)> load data local inpath '/home/hadoop/data/partitionData/pt_stu_nv.txt' into table static_partition_stu partition(gender='nv');
Loading data to table hwzhdb.static_partition_stu partition (gender=nv)
Partition hwzhdb.static_partition_stu{gender=nv} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]
OK
Time taken: 0.757 seconds
hive (hwzhdb)> select * from static_partition_stu;
OK
static_partition_stu.id static_partition_stu.name static_partition_stu.age static_partition_stu.gender
NULL xiaozhang 21 nan
NULL xiaopeng 18 nan
NULL xiaohong 11 nan
NULL zhangyan 21 nv
NULL xiaoli 18 nv
NULL xiaoxin 13 nv
Time taken: 0.554 seconds, Fetched: 6 row(s)
hive (hwzhdb)> select * from static_partition_stu where gender='nan';
OK
static_partition_stu.id static_partition_stu.name static_partition_stu.age static_partition_stu.gender
NULL xiaozhang 21 nan
NULL xiaopeng 18 nan
NULL xiaohong 11 nan
Time taken: 0.541 seconds, Fetched: 3 row(s)
hive (hwzhdb)> select * from static_partition_stu where gender='nv';
OK
static_partition_stu.id static_partition_stu.name static_partition_stu.age static_partition_stu.gender
NULL zhangyan 21 nv
NULL xiaoli 18 nv
NULL xiaoxin 13 nv
Time taken: 0.143 seconds, Fetched: 3 row(s)
二、通過insert的方式加載靜态分區資料
##檢視資料
hive (hwzhdb)> select * from emp;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 3
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
##建立分區表,将部門編号作為分區字段
hive (hwzhdb)> create external table static_partition_insert_emp(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double
> )
> partitioned by (deptno int)
> row format delimited fields terminated by '\t'
> stored as textfile;
OK
Time taken: 0.154 seconds
##通過insert的方式從emp中向static_partition_insert_emp中加載資料,需要注意,因為此時emp表中的字段數量跟static_partition_insert_emp的字段數量是不對應的,将emp中的deptno字段作為static_partition_insert_emp表中的分區字段了,是以從emp中查出來的字段要一個個對應static_partition_insert_emp中的字段
insert into table static_partition_insert_emp partition(deptno=10)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;
insert into table static_partition_insert_emp partition(deptno=20)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=20;
insert into table static_partition_insert_emp partition(deptno=30)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=30;
hive (hwzhdb)> select * from static_partition_insert_emp;
OK
static_partition_insert_emp.empno static_partition_insert_emp.ename static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate static_partition_insert_emp.sal static_partition_insert_emp.comm static_partition_insert_emp.deptno
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30
Time taken: 0.124 seconds, Fetched: 14 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=10;
OK
static_partition_insert_emp.empno static_partition_insert_emp.ename static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate static_partition_insert_emp.sal static_partition_insert_emp.comm static_partition_insert_emp.deptno
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
Time taken: 0.221 seconds, Fetched: 3 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=20;
OK
static_partition_insert_emp.empno static_partition_insert_emp.ename static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate static_partition_insert_emp.sal static_partition_insert_emp.comm static_partition_insert_emp.deptno
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
Time taken: 0.113 seconds, Fetched: 5 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=30;
OK
static_partition_insert_emp.empno static_partition_insert_emp.ename static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate static_partition_insert_emp.sal static_partition_insert_emp.comm static_partition_insert_emp.deptno
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30
Time taken: 0.119 seconds, Fetched: 6 row(s)
三、多級靜态分區,因為一個分區對應的是hdfs上面的一個目錄,多級分區則相當于hdfs上的多級目錄
比如我們還是從emp表中拿資料:通過 job和deptno進行多級分區
hive (hwzhdb)> select * from emp;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
Time taken: 0.083 seconds, Fetched: 14 row(s)
##插入分區資料
insert into table static_mut_partition_insert_emp partition(job='MANAGER',deptno=10)
select empno,ename,mgr,hiredate,sal,comm from emp where job='MANAGER' and deptno=10;
##通過指定分區查出資料
hive (hwzhdb)> select * from static_mut_partition_insert_emp where job='MANAGER' and deptno=10;
OK
static_mut_partition_insert_emp.empno static_mut_partition_insert_emp.ename static_mut_partition_insert_emp.mgr static_mut_partition_insert_emp.hiredate static_mut_partition_insert_emp.sal static_mut_partition_insert_emp.comm static_mut_partition_insert_emp.job static_mut_partition_insert_emp.deptno
7782 CLARK 7839 1981-06-09 00:00:00 2450.0 NULL MANAGER 10
Time taken: 0.149 seconds, Fetched: 1 row(s)
四、動态分區,當分區個數不确定或者分區個數過多的時候可以使用動态分區,動态分區隻能使用insert的方式插入資料,并且使用動态分區需要先把動态分區的設定改為非嚴格模式,預設為嚴格模式(set hive.exec.dynamic.partition.mode=nonstrict; 如果動态分區為關閉狀态的話還需要打開動态分區的設定set hive.exec.dynamic.partition=true;)
##看一下我們的資料,我們以年齡作為分區字段
hive (hwzhdb)> select * from stus;
OK
stus.id stus.name stus.sge
NULL zhangyan 21
NULL xiaoli 18
NULL xiaoxin 13
NULL xiaozhang 21
NULL xiaopeng 18
NULL xiaohong 1
Time taken: 0.087 seconds, Fetched: 6 row(s)
##建立動态分區表
hive (hwzhdb)> create table dynamic_insert_stu(
> id int,
> name string
> )
> partitioned by (age int)
> row format delimited fields terminated by '\t'
> stored as textfile;
OK
Time taken: 0.069 seconds
##從stus表中向dynamic_insert_stu中插入資料
hive (hwzhdb)> set hive.exec.dynamic.partition.mode=nonstrict;
insert into table dynamic_insert_stu partition(age)
select id,name,age from stus;
五、混合分區(待續...)
總結: 靜态分區表和動态分區表的差別?
1.建表無差別。
2.靜态分區能通過load和insert加載資料,不僅要指定分區字段,也要指定對應的值
load data local inpath ‘/home/hadoop/data/partitionData/pt_stu_nan.txt’ into table static_partition_stu partition(gender=‘nan’);
insert into table static_partition_insert_emp partition(deptno=10)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;
動态分區隻能通過insert的方式加載資料,隻需要指定分區字段就可以,但是必須要通過設定打開動态分區,并且動态分區的字段隻能放在select的最後一個字段,如果是多個動态分區字段,那對應放在select的最後幾個就行。
insert into table dynamic_insert_stu partition(age) select id,name,age
from stus;
Attention:
1、動态分區與靜态分區還有一個細微的差别是,靜态分區一定會建立分區,不管SELECT語句的結果有沒有資料。而動态分區,隻有在SELECT結果的記錄數>0的時候,才會建立分區。是以在不同的業務場景下,可能會選擇不同的方案。
2、另外使用動态分區時需要注意的比較重要的一點是,動态分區會為每一個分區配置設定reduce數。比如說你在腳本上面寫了:set
mapred.reduce.tasks=100;
這在分區值很多的情況下,會成為一個災難,容易直接把namenode給搞挂掉(因為namenode大量操作檔案),是非常危險的。是以使用動态分區時,一定要清楚地知道産生的動态分區值,并且合理地設定reduce數量。