1． Load

　　在将資料加載到表中時，Hive 不會進行任何轉換。加載操作是将資料檔案移動到與 Hive表對應的位置的純複制/移動操作。

　　文法結構:

load data [local] inpath 'filepath' [overwrite] into table tablename [partition (partcol1=val1, partcol2=val2 ...)]

　　說明：

　　1、 filepath

　　　　相對路徑，例如：project/data1

　　　　絕對路徑，例如：/user/hive/project/data1

　　　　完整 URI，例如：hdfs://namenode:9000/user/hive/project/data1

　　filepath 可以引用一個檔案（在這種情況下，Hive 将檔案移動到表中），或者它可以是一個目錄（在這種情況下，Hive 将把該目錄中的所有檔案移動到表中）。

　　2、 local

　　如果指定了 local， load 指令将在本地檔案系統中查找檔案路徑。

　　load 指令會将 filepath 中的檔案複制到目标檔案系統中。目标檔案系統由表的位置屬性決定。被複制的資料檔案移動到表的資料對應的位置。

　　如果沒有指定 LOCAL 關鍵字，如果 filepath 指向的是一個完整的 URI，hive會直接使用這個 URI。否則：如果沒有指定 schema 或者 authority，Hive 會使用在 hadoop 配置檔案中定義的schema 和 authority，fs.default.name 指定了Namenode 的 URI。

　　3、 overwrite

　　如果使用了 overwrite 關鍵字，則目标表（或者分區）中的内容會被删除，然後再将 filepath 指向的檔案/目錄中的内容添加到表/分區中。

　　如果目标表（分區）已經有一個檔案，并且檔案名和 filepath 中的檔案名沖突，那麼現有的檔案會被新檔案所替代。

2． Insert

　　Hive 中 insert 主要是結合 select 查詢語句使用，将查詢結果插入到表中，例如：

insert overwrite table stu_buck select * from student cluster by(Sno);

　　需要保證查詢結果列的數目和需要插入資料表格的列數目一緻.

　　如果查詢出來的資料類型和插入表格對應的列資料類型不一緻，将會進行轉換，但是不能保證轉換一定成功，轉換失敗的資料将會為 NULL。

　　可以将一個表查詢出來的資料插入到原表中, 結果相當于自我複制了一份資料。

　　Multi Inserts 多重插入:　　

from source_table
insert overwrite table tablename1 [partition (partcol1=val1,partclo2=val2)]
select_statement1
insert overwrite table tablename2 [partition (partcol1=val1,partclo2=val2)]
select_statement2..

　　Dynamic partition inserts 動态分區插入:

insert overwrite table tablename partition (partcol1[=val1], partcol2[=val2] ...)
select_statement FROM from_statement

　　動态分區是通過位置來對應分區值的。原始表 select 出來的值和輸出 partition的值的關系僅僅是通過位置來确定的，和名字并沒有關系。

　　導出表資料

　　文法結構

insert overwrite [local] directory directory1 SELECT ... FROM ...
multiple inserts:
FROM from_statement
insert overwrite [local] directory directory1 select_statement1
[insert overwrite [local] cirectory directory2 select_statement2] ...

　　資料寫入到檔案系統時進行文本序列化，且每列用^A 來區分，\n 為換行符。

3． Select

　　基本的 Select 操作

select [all | distinct] select_expr, select_expr, ...
from table_reference
join table_other on expr
[where where_condition]
[group by col_list [having condition]]
[cluster by col_list
| [distribute by col_list] [sort by| order by col_list]
]
[limit number]

　　說明：

　　　　1、order by 會對輸入做全局排序，是以隻有一個 reducer，會導緻當輸入規模較大時，需要較長的計算時間。

　　　　2、sort by 不是全局排序，其在資料進入 reducer 前完成排序。是以，如果用 sort by 進行排序，并且設定 mapred.reduce.tasks>1，則 sort by 隻保證每個 reducer 的輸出有序，不保證全局有序。

　　　　3、distribute by(字段)根據指定字段将資料分到不同的 reducer，分發算法是 hash 散列。

　　　　4、Cluster by(字段) 除了具有 Distribute by 的功能外，還會對該字段進行排序。

　　　　如果 distribute 和 sort 的字段是同一個時，此時，cluster by = distribute by + sort by

4． Hive join

　　Hive 中除了支援和傳統資料庫中一樣的内關聯、左關聯、右關聯、全關聯，還支援 left semi join 和 cross join，但這兩種 JOIN 類型也可以用前面的代替。

　　Hive 支援等值連接配接（a.id = b.id ）, 不支援非等值( (a.id>b.id) ) 的連接配接，因為非等值連接配接非常難轉化到 map/reduce 任務。另外，Hive 支援多 2 個以上表之間的 join。

　　寫 join 查詢時，需要注意幾個關鍵點：

　　join 時，每次 map/reduce 任務的邏輯

　　reducer 會緩存 join 序列中除了最後一個表的所有表的記錄，再通過最後一個表将結果序列化到檔案系統。這一實作有助于在 reduce 端減少記憶體的使用量。實踐中，應該把最大的那個表寫在最後（否則會因為緩存浪費大量記憶體）。

　　left ， right 和 full outer 關鍵字用于處理 join 中空記錄的情況

select a.val, b.val from a left outer join b on (a.key=b.key)

　　對應所有 a 表中的記錄都有一條記錄輸出。輸出的結果應該是 a.val, b.val，當a.key=b.key 時，而當 b.key 中找不到等值的 a.key 記錄時也會輸出:

　　　　a.val, null

　　是以 a 表中的所有記錄都被保留了；

　　“a right outer join b”會保留所有 b 表的記錄。

　　join 發生在 where 子句之前

　　如果你想限制 join 的輸出，應該在 where 子句中寫過濾條件——或是在 join 子句中寫。這裡面一個容易混淆的問題是表分區的情況：

select a.val, b.val from a
left outer join b on (a.key=b.key)
where a.ds='2009-07-07' and b.ds='2009-07-07'

　　這會 join a 表到 b 表（outer join），列出 a.val 和 b.val 的記錄。where 從句中可以使用其他列作為過濾條件。但是，如前所述，如果 b 表中找不到對應 a 表的記錄，b 表的所有列都會列出null，包括 ds 列。也就是說，join 會過濾 b 表中不能找到比對 a 表 join key 的所有記錄。這樣的話，left outer 就使得查詢結果與 where 子句無關了。解決的辦法是在 outer join 時使用以下文法：

select a.val, b.val from a left outer join b
on (a.key=b.key and
b.ds='2009-07-07' and
a.ds='2009-07-07')

　　這一查詢的結果是預先在 join 階段過濾過的，是以不會存在上述問題。這一邏輯也可以應用于 right 和 full 類型的 join 中。

　　join 是不能交換位置的

　　無論是 left 還是 right join，都是左連接配接的。

select a.val1, a.val2, b.val, c.val
from a
join b on (a.key = b.key)
left outer join c on (a.key = c.key)

　　先 join a 表到 b 表，丢棄掉所有 join key 中不比對的記錄，然後用這一中間結果和 c 表做 join。

Insert查詢語句

　　多重插入：

create table source_table (id int, name string) row format delimited fields terminated by ',';

create table test_insert1 (id int) row format delimited fields terminated by ',';

create table test_insert2 (name string) row format delimited fields terminated by ',';

from source_table

insert overwrite table test_insert1

select id

insert overwrite table test_insert2

select name;

　　動态分區插入

set hive.exec.dynamic.partition=true; #是否開啟動态分區功能，預設false關閉。

set hive.exec.dynamic.partition.mode=nonstrict; #動态分區的模式，預設strict，表示必須指定至少一個分區為靜态分區，nonstrict模式表示允許所有的分區字段都可以使用動态分區。

　　需求：

　　　　将dynamic_partition_table中的資料按照時間(day)，插入到目标表d_p_t的相應分區中。

　　　　原始表：

create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ",";

load data local inpath '/root/hivedata/dynamic_partition_table.txt' into table dynamic_partition_table;

2015-05-10,ip1
2015-05-10,ip2
2015-06-14,ip3
2015-06-14,ip4
2015-06-15,ip1
2015-06-15,ip2

　　目标表：

create table d_p_t(ip string) partitioned by (month string,day string);

　　動态插入：

insert overwrite table d_p_t partition (month,day)

select ip,substr(day,1,7) as month,day from dynamic_partition_table;

　　查詢結果導出到檔案系統

　　　　将查詢結果儲存到指定的檔案目錄（可以是本地，也可以是hdfs）

insert overwrite local directory '/root/123456'

select * from t_p;

insert overwrite directory '/aaa/test'

關于hive中的各種join

　　準備資料

1,a

2,b

3,c

4,d

7,y

8,u

2,bb

3,cc

7,yy

9,pp

　　建表：

create table a(id int,name string)

row format delimited fields terminated by ',';

create table b(id int,name string)

　　導入資料：

load data local inpath '/root/hivedata/a.txt' into table a;

load data local inpath '/root/hivedata/b.txt' into table b;

　　實驗：

　　　　** inner join

select * from a inner join b on a.id=b.id;

select a.id,a.name from a join b on a.id = b.id;

select a.* from a join b on a.id = b.id;

+-------+---------+-------+---------+--+

| a.id | a.name | b.id | b.name |

| 2 | b | 2 | bb |

| 3 | c | 3 | cc |

| 7 | y | 7 | yy |

　　　　**left join

select * from a left join b on a.id=b.id;

| 1 | a | NULL | NULL |

| 4 | d | NULL | NULL |

| 8 | u | NULL | NULL |

　　　　**right join

select * from a right join b on a.id=b.id;

select * from b right join a on b.id=a.id;

| NULL | NULL | 9 | pp |

　　　　**full outer join

select * from a full outer join b on a.id=b.id;

　　　　**hive中的特别join

select * from a left semi join b on a.id = b.id;

+-------+---------+--+

| a.id | a.name |

| 2 | b |

| 3 | c |

| 7 | y |

　　　　相當于

select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率極低

select a.id,a.name from a join b on (a.id = b.id);

　　　　cross join（##慎用）

　　傳回兩個表的笛卡爾積結果，不需要指定關聯鍵。

select a.*,b.* from a cross join b;

内置jason函數

select get_json_object(line,'$.movie') as moive,get_json_object(line,'$.rate') as rate from rat_json limit 10;

transform案例:

　　　　1、先加載rating.json檔案到hive的一個原始表 rat_json

create table rat_json(line string) row format delimited;

load data local inpath '/root/hivedata/rating.json' into table rat_json;

　　　　2、需要解析json資料成四個字段，插入一張新的表 t_rating

drop table if exists t_rating;

create table t_rating(movieid string,rate int,timestring string,uid string)

row format delimited fields terminated by '\t';

insert overwrite table t_rating

select get_json_object(line,'$.movie') as moive,get_json_object(line,'$.rate') as rate,get_json_object(line,'$.timeStamp') as timestring, get_json_object(line,'$.uid') as uid from rat_json limit 10;

　　　　3、使用transform+python的方式去轉換unixtime為weekday

　　　　　　先編輯一個python腳本檔案

　　　　　　########python######代碼

vi weekday_mapper.py
#!/bin/python
import sys
import datetime

for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])

　　　　儲存檔案

　　　　然後，将檔案加入hive的classpath：

　　　　　　hive>add FILE /root/hivedata/weekday_mapper.py;

create table u_data_new as select

transform (movieid, rate, timestring,uid)

using 'python weekday_mapper.py'

as (movieid, rate, weekday,uid)

from t_rating;

select distinct(weekday) from u_data_new limit 10;

desc formatted student;

Hive的DML操作

1． Load

2． Insert

3． Select

4． Hive join

Insert查詢語句

多重插入：

動态分區插入

查詢結果導出到檔案系統

關于hive中的各種join

内置jason函數

transform案例:

　　多重插入：

　　動态分區插入

　　查詢結果導出到檔案系統