目标
這節課展示了如何在Apache Drill中做一些标準的SQL分析:例如,總結資料通過使用簡單集合函數和連接配接資料源。注意,Apache Drill提供ANSI SQL支援,不是一個“類似sql”接口(界面)。
查詢在這節課中
現在你知道在他們的原始形式資料源是什麼樣子的,使用 select *查詢,嘗試運作一些簡單但更有意義的查詢在每個資料源。這些查詢示範Drill如何支援ANSI SQL構造和鑽還如何把來自不同資料源的資料結合在一個單一SELECT聲明。
- 在一個單一檔案或者表上顯示一個合并查詢。 使用GROUP BY,WHERE,HAVING,ORDER BY子句。
- 執行hive之間的連接配接、MapR-DB和檔案系統的資料源。
- 用表和列别名。
- 建立一個鑽視圖。
聚合
設定hive模式:
0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
| ok | summary |
+-------+-------------------------------------------+
| true | Default schema changed to [hive.default] |
+-------+-------------------------------------------+
1 row selected
傳回月銷量總額:
0: jdbc:drill:> select `month`, sum(order_total) from orders group by `month` order by 2 desc;
+------------+---------+
| month | EXPR$1 |
+------------+---------+
| June | 950481 |
| May | 947796 |
| March | 836809 |
| April | 807291 |
| July | 757395 |
| October | 676236 |
| August | 572269 |
| February | 532901 |
| September | 373100 |
| January | 346536 |
+------------+---------+
10 rows selected
Drill支援SQL聚合函數,比如SUM,MAX,AVG和MIN。标準的SQL句子做為關系資料庫以同樣的方式在鑽查詢中工作。
需要注意,back ticks“month”的列查詢,隻是因為“month” 在SQL中是保留字。
傳回月和州銷售總額的前20名:
0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state order by 3 desc limit 20;
+-----------+--------+---------+
| month | state | sales |
+-----------+--------+---------+
| May | ca | 119586 |
| June | ca | 116322 |
| April | ca | 101363 |
| March | ca | 99540 |
| July | ca | 90285 |
| October | ca | 80090 |
| June | tx | 78363 |
| May | tx | 77247 |
| March | tx | 73815 |
| August | ca | 71255 |
| April | tx | 68385 |
| July | tx | 63858 |
| February | ca | 63527 |
| June | fl | 62199 |
| June | ny | 62052 |
| May | fl | 61651 |
| May | ny | 59369 |
| October | tx | 55076 |
| March | fl | 54867 |
| March | ny | 52101 |
+-----------+--------+---------+
20 rows selected
SUM函數結果要注意别名。鑽支援列别名和表别名。
HAVING條款
這個查詢使用HAVING條款去束縛一個聚合結果。
設定dfs.clicks的工作區間:
0: jdbc:drill:> use dfs.clicks;
+-------+-----------------------------------------+
| ok | summary |
+-------+-----------------------------------------+
| true | Default schema changed to [dfs.clicks] |
+-------+-----------------------------------------+
1 row selected
傳回顯示高點選的裝置總數:
0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t group by t.user_info.device having count(*) > 1000;
+---------+---------+
| EXPR$0 | EXPR$1 |
+---------+---------+
| IOS5 | 11814 |
| AOS4.2 | 5986 |
| IOS6 | 4464 |
| IOS7 | 3135 |
| AOS4.4 | 1562 |
| AOS4.3 | 3039 |
+---------+---------+
6 rows selected
聚合是一種在點選流資料中的計數,為每個不同的移動裝置記錄。隻有裝置注冊更多的超過1000的交易活動獲得結果集。
UNION操作符★★★
像上面一樣使用相同的工作區(dfs.clicks)。
在營銷活動前後結合點選活動
0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t union all select u.trans_id, u.user_info.cust_id from `clicks/clicks.json` u limit 5;
+-------------+------------+
| transaction | customer |
+-------------+------------+
| 35232 | 18520 |
| 31995 | 17182 |
| 35760 | 18228 |
| 37090 | 17015 |
| 37838 | 18737 |
+-------------+------------+
UNION ALL查詢傳回所有存在的兩個檔案行中的行(包括任何從這些檔案重複的行): clicks.campaign.json 和 clicks.json。
子查詢
設定hive工作區:
0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
| ok | summary |
+-------+-------------------------------------------+
| true | Default schema changed to [hive.default] |
+-------+-------------------------------------------+
1 row selected
比較跨洲的訂單總量:
0: jdbc:drill:> select ny_sales.cust_id, ny_sales.total_orders, ca_sales.total_orders
from
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ny' group by o.cust_id) ny_sales
left outer join
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ca' group by o.cust_id) ca_sales
on ny_sales.cust_id = ca_sales.cust_id
order by ny_sales.cust_id
limit 20;
+------------+------------+------------+
| cust_id | ny_sales | ca_sales |
+------------+------------+------------+
| 1001 | 72 | 47 |
| 1002 | 108 | 198 |
| 1003 | 83 | null |
| 1004 | 86 | 210 |
| 1005 | 168 | 153 |
| 1006 | 29 | 326 |
| 1008 | 105 | 168 |
| 1009 | 443 | 127 |
| 1010 | 75 | 18 |
| 1012 | 110 | null |
| 1013 | 19 | null |
| 1014 | 106 | 162 |
| 1015 | 220 | 153 |
| 1016 | 85 | 159 |
| 1017 | 82 | 56 |
| 1019 | 37 | 196 |
| 1020 | 193 | 165 |
| 1022 | 124 | null |
| 1023 | 166 | 149 |
| 1024 | 233 | null |
+------------+------------+------------+
這個例子示範了Drill支援子查詢。
CAST函數★★
使用maprdb工作區:
0: jdbc:drill:> use maprdb;
+-------+-------------------------------------+
| ok | summary |
+-------+-------------------------------------+
| true | Default schema changed to [maprdb] |
+-------+-------------------------------------+
1 row selected (0.088 seconds)
用适當的資料類型傳回客戶資料:
0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name,
cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age,
cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,
cast(t.loyalty.membership as varchar(20)) as membership
from customers t limit 5;
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| cust_id | name | gender | age | state | agg_rev | membership |
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" |
| 10005 | "Brittany Park" | "MALE" | "26-35" | "in" | 230.00 | "silver" |
| 10006 | "Rose Lokey" | "MALE" | "26-35" | "ca" | 250.00 | "silver" |
| 10007 | "James Fowler" | "FEMALE" | "51-100" | "me" | 263.00 | "silver" |
| 10010 | "Guillermo Koehler" | "OTHER" | "51-100" | "mn" | 202.00 | "silver" |
+----------+----------------------+-----------+-----------+--------+----------+-------------+
注意這個查詢的以下特點:
- CAST函數所需表中每一列。這個函數傳回MapR-DB / HBase二進制可讀整型和字元串資料。或者,您可以使用CONVERT_TO / CONVERT_FROM函數譯碼字元串列。在大多數情況下CONVERT_TO / CONVERT_FROM比CAST更有效。隻使用CONVERT_TO轉換二進制類型到VARCHAR以外的其他任何類型。☆♫
- row_key列函數作為表的主鍵(在這種情況下客戶ID)。
- 表别名t是必需的;否則列族名稱将被解析為表名,查詢将傳回一個錯誤。★▲
删除字元串引号:
你可以使用regexp_replace函數來删除在查詢結果字元串中的引号。例如,傳回弗吉尼亞州一個州的名字va而不是“va”:
0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','')
from customers t limit 1;
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| 10001 | va |
+------------+------------+
1 row selected
建立視圖指令
0: jdbc:drill:> use dfs.views;
+-------+----------------------------------------+
| ok | summary |
+-------+----------------------------------------+
| true | Default schema changed to [dfs.views] |
+-------+----------------------------------------+
1 row selected
使用可變工作區:
一個可變的(或可寫)工作空間是一個支援“寫” 操作工作區。這個屬性是配置存儲插件的一部分。 你可以在可變工作區建立鑽視圖和表。
在MapR-DB表建立視圖:
0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id,
cast(t.personal.name as varchar(20)) as name,
cast(t.personal.gender as varchar(10)) as gender,
cast(t.personal.age as varchar(10)) as age,
cast(t.address.state as varchar(4)) as state,
cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,
cast(t.loyalty.membership as varchar(20)) as membership
from maprdb.customers t;
+-------+-------------------------------------------------------------+
| ok | summary |
+-------+-------------------------------------------------------------+
| true | View 'custview' created successfully in 'dfs.views' schema |
+-------+-------------------------------------------------------------+
1 row selected
鑽提供了類似于關系資料庫建立(CREAT)或替換視圖(REPLACE VIEW)的文法建立視圖。使用或替換選項更容易檢視更新視圖後沒有先删除它。注意,在FROM子句中這個例子必須參考maprdb.customers。MapR-DB表到dfs.views工作區并不直接可見。
不像傳統資料庫視圖位置,代表性的是DBA /開發者驅動操作,在drill中基于檔案系統的視圖非常無足輕重。一個視圖隻是一個特殊的檔案與一個特定的(.drill)擴充。您可以存儲視圖在您的本地檔案系統或指向一個特定的工作區。你可以指定任何查詢而不是鑽資料源在建立視圖體内聲明。
鑽提供了分散的中繼資料模型。鑽能夠查詢定義在資料源的中繼資料如hive,HBase和檔案系統。鑽也支援在檔案系統中的建立中繼資料。
查詢的視圖中資料:
0: jdbc:drill:> select * from custview limit 1;
+----------+-------------------+-----------+----------+--------+----------+-------------+
| cust_id | name | gender | age | state | agg_rev | membership |
+----------+-------------------+-----------+----------+--------+----------+-------------+
| 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" |
+----------+-------------------+-----------+----------+--------+----------+-------------+
1 row selected
通過直接從檔案系統中探測一旦使用者知道哪些資料可用,視圖可以用來讀取資料到下遊工具例如Tableau和MicroStrategy進行分析和可視化。 這些工具,出現一個視圖隻是作為一個“表”和一個可選擇的“列”。
通過資料源查詢
繼續使用 dfs.views查詢。
★★連結客戶視圖和訂單表:
0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview
where orders.cust_id=custview.cust_id
group by membership order by 2;
+------------+------------+
| membership | sales |
+------------+------------+
| "basic" | 380665 |
| "silver" | 708438 |
| "gold" | 2787682 |
+------------+------------+
3 rows selected
★在這個查詢中,我們從MapR-DB表(由custview代表)讀取資料,并與訂單表資訊合并在hive中。 當做像這樣的跨資料源查詢,你需要使用完全限定 表/視圖名稱。例如,通過“hive”orders表是有字首的,那是存儲插件通過drill注冊的名稱。我們不能為“custview”使用任何字首,因為我們明确地轉換了dfs.views custview被存儲在哪的工作區。
注意:如果你的任何查詢的結果似乎被截斷,因為 行寬,設定顯示的最大寬度10000:
這組指令不使用分号。
★★★加入客戶、訂單和點選流資料:
0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview,
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c
where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id
group by custview.membership order by 2;
+------------+------------+
| membership | sales |
+------------+------------+
| "basic" | 372866 |
| "silver" | 728424 |
| "gold" | 7050198 |
+------------+------------+
3 rows selected
★這三方結合選擇從三個不同的資料源在一次查詢查詢:
- hive.orders表
- custview(HBase客戶表的視圖)
- clicks.json檔案
兩組的連接配接列加入條件是cust_id列。視圖工作區用于這個查詢以至于custview可以通路。hive.orders表也輕松查詢。
然而,注意JSON檔案不是從視圖工作區中直接可見的,是以查詢指定了檔案的完整路徑:
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json`
Lesson 3: Run Queries on Complex Data Types →