Hive的order by、sort by、distribute by、cluster byHive 的 sort by 與 order by、distribute by 與 cluster by

2023-07-22 19:17:43

Hive 的 sort by 與 order by、distribute by 與 cluster by

文章目錄

Hive 的 sort by 與 order by、distribute by 與 cluster by
- sort by 與 order by
- distribute by 與 cluster by

sort by 與 order by

我們知道，在

MapReduce

中，每個分區的資料是

key

值有序的，有幾個

reduce

任務就有幾個分區，當隻有一個分區時，資料就是全局有序的了。

sort by

的功能就是保證每個分區有序，而

order by

就相當于全局有序，即這幾個分區連起來也是有序的。

為了能夠看出他們的差別，我們需要提前設定

reduce

任務的個數大于

：

hive > set mapred.reduce.tasks=2;

建立測試表

sortandorder

：

create table sortandorder(
    id int
)
row format delimited
stored as textfile;

導入測試資料：

hive > load data local inpath '/home/au/sortandorder' into table sortandorder;

sortandorder檔案資料如下：
1
3
2
9
10
8
4
7
6
5

用order by查詢資料：

hive > select * from sortandorder order by id;
結果如下：
1
2
3
4
5
6
7
8
9
10

用sort by查詢資料：

hive > select * from sortandorder sort by id;
結果如下：
1
2
5
7
9
10
3
4
6
8

由于分了兩個區，可以看出

order by

是全局排序，

sort by

是區内排序，分區個數由

reduce

任務個數決定。

distribute by 與 cluster by

distribute by

：按照指定的字段或表達式對資料進行劃分，輸出到對應的

reduce

或者檔案中。

cluster by

：除了兼具

distribute by

的功能，還具有

sort by

的排序功能。

利用上面的

sortandorder

的表進行

distribute by

分區，存入本地檔案

/home/au/distributeandcluster/

：

insert overwrite local directory '/home/au/distributeandcluster/'
select id from sortandorder distribute by id; // 還是用上面sortandorder的表

運作完後可在

/home/au/distributeandcluster/

目錄下看到有兩個檔案

(00000_0，000001_0)

，并且兩個檔案内都是無序的。

同樣，用

cluster by

進行分區，存入本地檔案

/home/au/distributeandcluster/

：

insert overwrite local directory '/home/au/distributeandcluster/'
select id from sortandorder cluster by id; // 還是用上面sortandorder的表

運作完後可在

/home/au/distributeandcluster/

目錄下看到有兩個檔案

(00000_0，000001_0)

，并且兩個檔案内都是有序的。

Hive的order by、sort by、distribute by、cluster byHive 的 sort by 與 order by、distribute by 與 cluster by

Hive 的 sort by 與 order by、distribute by 與 cluster by

文章目錄

sort by 與 order by

distribute by 與 cluster by

繼續閱讀

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

HDFS指令行工具

【51CTO學院三周年】自學路上的伴侶

線上教育巨頭多鄰國Duolingo入華一周年，中國市場馬力全開

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

淺談企業活動中進行資料分析的重要性

Ambari介紹和架構原理

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark