天天看點

Hive分區分桶詳解

一、分區概念:分區最主要的目的:實作快速查詢

為什麼要建立分區:

單個表資料量越來越大的時候,在Hive select查詢中一般會掃描整個表内容(暴力掃描),會消耗很多時間做沒必要的工作。有時候隻需要掃描表中關心的一部分資料,是以建表時引入了partition概念。

(1)、Hive的分區和mysql的分區差異:

mysql分區是将表中的字段拿來直接作為分區字段,而hive的分區則是分區字段不在表中。

(2)、怎麼分區:

根據業務分區,(完全看業務場景)選取id、年、月、日、男女性别、年齡段或者是能平均将資料分到不同檔案中最好,分區不好将直接導緻查詢結果延遲。

(3)、分區細節:

1、一個表可以擁有一個或者多個分區,每個分區以檔案夾的形式單獨存在表檔案夾的目錄下。

2、表和列名不區分大小寫。

3、分區是以字段的形式在表結構中存在,通過describe table指令可以檢視到字段存在(算是一個僞列),但是該字段不存放實際的資料内容,僅僅是分區的表示。

4、分區有一級、二級、三級和多級分區:

5、建立動态分區、靜态分區、混合分區:

動态分區:可以動态加載資料或分區個數不固定

靜态分區:可以靜态加入資料或分區個數确定

混合分區:動态和靜态結合加入資料,主分區必須為靜态分區,副分區可以為動态分區

example:
一、通過load的方式加載靜态分區資料
##看一下我們的資料
[[email protected] partitionData]$ cat pt_stu_nv.txt 
095041211011,zhangyan,21
095041211034,xiaoli,18
095041211056,xiaoxin,13
[[email protected] partitionData]$ cat pt_stu_nan.txt 
095041211001,xiaozhang,21
095041211029,xiaopeng,18
095041211089,xiaohong,11
##建立分區表,需要注意分區字段不在表字段中
hive (hwzhdb)> create external table static_partition_stu(
             > id int,
             > name string
             > )
             > partitioned by (age int)
             > row format delimited fields terminated by ','
             > stored as textfile;
OK
Time taken: 0.308 seconds
##加載分區資料,必須在表名後面加上分區字段和值。此時我們是加載的本地資料,生産來說的話是需要加載hdfs上的資料,因為資料做清洗操作後是存放在hdfs上面的。
hive (hwzhdb)> load data local inpath '/home/hadoop/data/partitionData/pt_stu_nan.txt' into table static_partition_stu partition(gender='nan');
Loading data to table hwzhdb.static_partition_stu partition (gender=nan)
Partition hwzhdb.static_partition_stu{gender=nan} stats: [numFiles=1, numRows=0, totalSize=76, rawDataSize=0]
OK
Time taken: 1.525 seconds
hive (hwzhdb)> load data local inpath '/home/hadoop/data/partitionData/pt_stu_nv.txt' into table static_partition_stu partition(gender='nv');
Loading data to table hwzhdb.static_partition_stu partition (gender=nv)
Partition hwzhdb.static_partition_stu{gender=nv} stats: [numFiles=1, numRows=0, totalSize=72, rawDataSize=0]
OK
Time taken: 0.757 seconds

hive (hwzhdb)> select * from static_partition_stu;
OK
static_partition_stu.id static_partition_stu.name       static_partition_stu.age       static_partition_stu.gender
NULL    xiaozhang       21      nan
NULL    xiaopeng        18      nan
NULL    xiaohong        11      nan
NULL    zhangyan        21      nv
NULL    xiaoli  18      nv
NULL    xiaoxin 13      nv
Time taken: 0.554 seconds, Fetched: 6 row(s)
hive (hwzhdb)> select * from static_partition_stu where gender='nan';
OK
static_partition_stu.id static_partition_stu.name       static_partition_stu.age       static_partition_stu.gender
NULL    xiaozhang       21      nan
NULL    xiaopeng        18      nan
NULL    xiaohong        11      nan
Time taken: 0.541 seconds, Fetched: 3 row(s)
hive (hwzhdb)> select * from static_partition_stu where gender='nv';
OK
static_partition_stu.id static_partition_stu.name       static_partition_stu.age       static_partition_stu.gender
NULL    zhangyan        21      nv
NULL    xiaoli  18      nv
NULL    xiaoxin 13      nv
Time taken: 0.143 seconds, Fetched: 3 row(s)
二、通過insert的方式加載靜态分區資料
##檢視資料
hive (hwzhdb)> select * from emp;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm       emp.deptno
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  3
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
##建立分區表,将部門編号作為分區字段
hive (hwzhdb)> create external table static_partition_insert_emp(
             > empno int,
             > ename string,
             > job string,
             > mgr int,
             > hiredate string,
             > sal double,
             > comm double
             > )
             > partitioned by (deptno int)
             > row format delimited fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 0.154 seconds
##通過insert的方式從emp中向static_partition_insert_emp中加載資料,需要注意,因為此時emp表中的字段數量跟static_partition_insert_emp的字段數量是不對應的,将emp中的deptno字段作為static_partition_insert_emp表中的分區字段了,是以從emp中查出來的字段要一個個對應static_partition_insert_emp中的字段

insert into table static_partition_insert_emp partition(deptno=10)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;

insert into table static_partition_insert_emp partition(deptno=20)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=20;

insert into table static_partition_insert_emp partition(deptno=30)
select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=30;
hive (hwzhdb)> select * from static_partition_insert_emp;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
Time taken: 0.124 seconds, Fetched: 14 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=10;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
Time taken: 0.221 seconds, Fetched: 3 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=20;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
Time taken: 0.113 seconds, Fetched: 5 row(s)
hive (hwzhdb)> select * from static_partition_insert_emp where deptno=30;
OK
static_partition_insert_emp.empno       static_partition_insert_emp.ename       static_partition_insert_emp.job static_partition_insert_emp.mgr static_partition_insert_emp.hiredate    static_partition_insert_emp.sal static_partition_insert_emp.comm        static_partition_insert_emp.deptno
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
Time taken: 0.119 seconds, Fetched: 6 row(s)
三、多級靜态分區,因為一個分區對應的是hdfs上面的一個目錄,多級分區則相當于hdfs上的多級目錄
比如我們還是從emp表中拿資料:通過 job和deptno進行多級分區
hive (hwzhdb)> select * from emp;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm       emp.deptno
7369    SMITH   CLERK   7902    1980-12-17    00:00:00  800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20    00:00:00  1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22    00:00:00  1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02    00:00:00  2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-09-28    00:00:00  1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01    00:00:00  2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09    00:00:00  2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1982-12-09    00:00:00  3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17    00:00:00  5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-09-08    00:00:00  1500.0  0.0     30
7876    ADAMS   CLERK   7788    1983-01-12    00:00:00  1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03    00:00:00  950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03    00:00:00  3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23    00:00:00  1300.0  NULL    10
Time taken: 0.083 seconds, Fetched: 14 row(s)
##插入分區資料
insert into table static_mut_partition_insert_emp partition(job='MANAGER',deptno=10)
select empno,ename,mgr,hiredate,sal,comm from emp where job='MANAGER' and deptno=10;
##通過指定分區查出資料
hive (hwzhdb)> select * from static_mut_partition_insert_emp where job='MANAGER' and deptno=10;
OK
static_mut_partition_insert_emp.empno   static_mut_partition_insert_emp.ename   static_mut_partition_insert_emp.mgr     static_mut_partition_insert_emp.hiredate        static_mut_partition_insert_emp.sal     static_mut_partition_insert_emp.comm    static_mut_partition_insert_emp.job     static_mut_partition_insert_emp.deptno
7782    CLARK   7839    1981-06-09    00:00:00  2450.0  NULL    MANAGER 10
Time taken: 0.149 seconds, Fetched: 1 row(s)
四、動态分區,當分區個數不确定或者分區個數過多的時候可以使用動态分區,動态分區隻能使用insert的方式插入資料,并且使用動态分區需要先把動态分區的設定改為非嚴格模式,預設為嚴格模式(set hive.exec.dynamic.partition.mode=nonstrict; 如果動态分區為關閉狀态的話還需要打開動态分區的設定set hive.exec.dynamic.partition=true;)
##看一下我們的資料,我們以年齡作為分區字段
hive (hwzhdb)> select * from stus;
OK
stus.id stus.name       stus.sge
NULL    zhangyan        21
NULL    xiaoli  18
NULL    xiaoxin 13
NULL    xiaozhang       21
NULL    xiaopeng        18
NULL    xiaohong        1
Time taken: 0.087 seconds, Fetched: 6 row(s)
##建立動态分區表
hive (hwzhdb)> create table dynamic_insert_stu(
             > id int,
             > name string
             > )
             > partitioned by (age int)
             > row format delimited fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 0.069 seconds
##從stus表中向dynamic_insert_stu中插入資料
hive (hwzhdb)> set hive.exec.dynamic.partition.mode=nonstrict;

insert into table dynamic_insert_stu partition(age)
select id,name,age from stus;
五、混合分區(待續...)
           

總結: 靜态分區表和動态分區表的差別?

1.建表無差別。

2.靜态分區能通過load和insert加載資料,不僅要指定分區字段,也要指定對應的值

load data local inpath ‘/home/hadoop/data/partitionData/pt_stu_nan.txt’ into table static_partition_stu partition(gender=‘nan’);

insert into table static_partition_insert_emp partition(deptno=10)

select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;

動态分區隻能通過insert的方式加載資料,隻需要指定分區字段就可以,但是必須要通過設定打開動态分區,并且動态分區的字段隻能放在select的最後一個字段,如果是多個動态分區字段,那對應放在select的最後幾個就行。

insert into table dynamic_insert_stu partition(age) select id,name,age

from stus;

Attention:

1、動态分區與靜态分區還有一個細微的差别是,靜态分區一定會建立分區,不管SELECT語句的結果有沒有資料。而動态分區,隻有在SELECT結果的記錄數>0的時候,才會建立分區。是以在不同的業務場景下,可能會選擇不同的方案。

2、另外使用動态分區時需要注意的比較重要的一點是,動态分區會為每一個分區配置設定reduce數。比如說你在腳本上面寫了:set

mapred.reduce.tasks=100;

這在分區值很多的情況下,會成為一個災難,容易直接把namenode給搞挂掉(因為namenode大量操作檔案),是非常危險的。是以使用動态分區時,一定要清楚地知道産生的動态分區值,并且合理地設定reduce數量。

繼續閱讀