天天看點

學習鑽MapR Sandbox Lesson2:用ANSI SQL運作查詢《譯》

目标

這節課展示了如何在Apache Drill中做一些标準的SQL分析:例如,總結資料通過使用簡單集合函數和連接配接資料源。注意,Apache Drill提供ANSI SQL支援,不是一個“類似sql”接口(界面)。

查詢在這節課中

現在你知道在他們的原始形式資料源是什麼樣子的,使用 select *查詢,嘗試運作一些簡單但更有意義的查詢在每個資料源。這些查詢示範Drill如何支援ANSI SQL構造和鑽還如何把來自不同資料源的資料結合在一個單一SELECT聲明。

  • 在一個單一檔案或者表上顯示一個合并查詢。 使用GROUP BY,WHERE,HAVING,ORDER BY子句。
  • 執行hive之間的連接配接、MapR-DB和檔案系統的資料源。
  • 用表和列别名。
  • 建立一個鑽視圖。

聚合

設定hive模式:

0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
|  ok   |                  summary                  |
+-------+-------------------------------------------+
| true  | Default schema changed to [hive.default]  |
+-------+-------------------------------------------+
1 row selected 
           

傳回月銷量總額:

0: jdbc:drill:> select `month`, sum(order_total) from orders group by `month` order by 2 desc;
+------------+---------+
|   month    | EXPR$1  |
+------------+---------+
| June       | 950481  |
| May        | 947796  |
| March      | 836809  |
| April      | 807291  |
| July       | 757395  |
| October    | 676236  |
| August     | 572269  |
| February   | 532901  |
| September  | 373100  |
| January    | 346536  |
+------------+---------+
10 rows selected 
           

Drill支援SQL聚合函數,比如SUM,MAX,AVG和MIN。标準的SQL句子做為關系資料庫以同樣的方式在鑽查詢中工作。

需要注意,back ticks“month”的列查詢,隻是因為“month” 在SQL中是保留字。

傳回月和州銷售總額的前20名:

0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state order by 3 desc limit 20;
+-----------+--------+---------+
|   month   | state  |  sales  |
+-----------+--------+---------+
| May       | ca     | 119586  |
| June      | ca     | 116322  |
| April     | ca     | 101363  |
| March     | ca     | 99540   |
| July      | ca     | 90285   |
| October   | ca     | 80090   |
| June      | tx     | 78363   |
| May       | tx     | 77247   |
| March     | tx     | 73815   |
| August    | ca     | 71255   |
| April     | tx     | 68385   |
| July      | tx     | 63858   |
| February  | ca     | 63527   |
| June      | fl     | 62199   |
| June      | ny     | 62052   |
| May       | fl     | 61651   |
| May       | ny     | 59369   |
| October   | tx     | 55076   |
| March     | fl     | 54867   |
| March     | ny     | 52101   |
+-----------+--------+---------+
20 rows selected 
           

SUM函數結果要注意别名。鑽支援列别名和表别名。

HAVING條款

這個查詢使用HAVING條款去束縛一個聚合結果。

設定dfs.clicks的工作區間:

0: jdbc:drill:> use dfs.clicks;
+-------+-----------------------------------------+
|  ok   |                 summary                 |
+-------+-----------------------------------------+
| true  | Default schema changed to [dfs.clicks]  |
+-------+-----------------------------------------+
1 row selected
           

傳回顯示高點選的裝置總數:

0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t  group by t.user_info.device having count(*) > 1000;
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| IOS5    | 11814   |
| AOS4.2  | 5986    |
| IOS6    | 4464    |
| IOS7    | 3135    |
| AOS4.4  | 1562    |
| AOS4.3  | 3039    |
+---------+---------+
6 rows selected
           

聚合是一種在點選流資料中的計數,為每個不同的移動裝置記錄。隻有裝置注冊更多的超過1000的交易活動獲得結果集。

UNION操作符★★★

像上面一樣使用相同的工作區(dfs.clicks)。

在營銷活動前後結合點選活動

0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t union all select u.trans_id, u.user_info.cust_id  from `clicks/clicks.json` u limit 5;
+-------------+------------+
| transaction |  customer  |
+-------------+------------+
| 35232       | 18520      |
| 31995       | 17182      |
| 35760       | 18228      |
| 37090       | 17015      |
| 37838       | 18737      |
+-------------+------------+  
           

UNION ALL查詢傳回所有存在的兩個檔案行中的行(包括任何從這些檔案重複的行): clicks.campaign.json 和 clicks.json。

子查詢

設定hive工作區:

0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
|  ok   |                  summary                  |
+-------+-------------------------------------------+
| true  | Default schema changed to [hive.default]  |
+-------+-------------------------------------------+
1 row selected
           

比較跨洲的訂單總量:

0: jdbc:drill:> select ny_sales.cust_id, ny_sales.total_orders, ca_sales.total_orders
from
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ny' group by o.cust_id) ny_sales
left outer join
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ca' group by o.cust_id) ca_sales
on ny_sales.cust_id = ca_sales.cust_id
order by ny_sales.cust_id
limit 20;
+------------+------------+------------+
|  cust_id   |  ny_sales  |  ca_sales  |
+------------+------------+------------+
| 1001       | 72         | 47         |
| 1002       | 108        | 198        |
| 1003       | 83         | null       |
| 1004       | 86         | 210        |
| 1005       | 168        | 153        |
| 1006       | 29         | 326        |
| 1008       | 105        | 168        |
| 1009       | 443        | 127        |
| 1010       | 75         | 18         |
| 1012       | 110        | null       |
| 1013       | 19         | null       |
| 1014       | 106        | 162        |
| 1015       | 220        | 153        |
| 1016       | 85         | 159        |
| 1017       | 82         | 56         |
| 1019       | 37         | 196        |
| 1020       | 193        | 165        |
| 1022       | 124        | null       |
| 1023       | 166        | 149        |
| 1024       | 233        | null       |
+------------+------------+------------+
           

這個例子示範了Drill支援子查詢。

CAST函數★★

使用maprdb工作區:

0: jdbc:drill:> use maprdb;
+-------+-------------------------------------+
|  ok   |               summary               |
+-------+-------------------------------------+
| true  | Default schema changed to [maprdb]  |
+-------+-------------------------------------+
1 row selected (0.088 seconds)
           

用适當的資料類型傳回客戶資料:

0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name, 
cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age,
cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, 
cast(t.loyalty.membership as varchar(20)) as membership
from customers t limit 5;
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| cust_id  |         name         |  gender   |    age    | state  | agg_rev  | membership  |
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| 10001    | "Corrine Mecham"     | "FEMALE"  | "15-20"   | "va"   | 197.00   | "silver"    |
| 10005    | "Brittany Park"      | "MALE"    | "26-35"   | "in"   | 230.00   | "silver"    |
| 10006    | "Rose Lokey"         | "MALE"    | "26-35"   | "ca"   | 250.00   | "silver"    |
| 10007    | "James Fowler"       | "FEMALE"  | "51-100"  | "me"   | 263.00   | "silver"    |
| 10010    | "Guillermo Koehler"  | "OTHER"   | "51-100"  | "mn"   | 202.00   | "silver"    |
+----------+----------------------+-----------+-----------+--------+----------+-------------+
           

注意這個查詢的以下特點:

  • CAST函數所需表中每一列。這個函數傳回MapR-DB / HBase二進制可讀整型和字元串資料。或者,您可以使用CONVERT_TO / CONVERT_FROM函數譯碼字元串列。在大多數情況下CONVERT_TO / CONVERT_FROM比CAST更有效。隻使用CONVERT_TO轉換二進制類型到VARCHAR以外的其他任何類型。☆♫
  • row_key列函數作為表的主鍵(在這種情況下客戶ID)。
  • 表别名t是必需的;否則列族名稱将被解析為表名,查詢将傳回一個錯誤。★▲

删除字元串引号:

你可以使用regexp_replace函數來删除在查詢結果字元串中的引号。例如,傳回弗吉尼亞州一個州的名字va而不是“va”:

0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','')
from customers t limit 1;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| 10001      | va         |
+------------+------------+
1 row selected  
           

建立視圖指令

0: jdbc:drill:> use dfs.views;
+-------+----------------------------------------+
|  ok   |                summary                 |
+-------+----------------------------------------+
| true  | Default schema changed to [dfs.views]  |
+-------+----------------------------------------+
1 row selected
           

使用可變工作區:

一個可變的(或可寫)工作空間是一個支援“寫” 操作工作區。這個屬性是配置存儲插件的一部分。 你可以在可變工作區建立鑽視圖和表。

在MapR-DB表建立視圖:

0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id,
cast(t.personal.name as varchar(20)) as name, 
cast(t.personal.gender as varchar(10)) as gender, 
cast(t.personal.age as varchar(10)) as age, 
cast(t.address.state as varchar(4)) as state,
cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,
cast(t.loyalty.membership as varchar(20)) as membership
from maprdb.customers t;
+-------+-------------------------------------------------------------+
|  ok   |                           summary                           |
+-------+-------------------------------------------------------------+
| true  | View 'custview' created successfully in 'dfs.views' schema  |
+-------+-------------------------------------------------------------+
1 row selected
           

鑽提供了類似于關系資料庫建立(CREAT)或替換視圖(REPLACE VIEW)的文法建立視圖。使用或替換選項更容易檢視更新視圖後沒有先删除它。注意,在FROM子句中這個例子必須參考maprdb.customers。MapR-DB表到dfs.views工作區并不直接可見。

不像傳統資料庫視圖位置,代表性的是DBA /開發者驅動操作,在drill中基于檔案系統的視圖非常無足輕重。一個視圖隻是一個特殊的檔案與一個特定的(.drill)擴充。您可以存儲視圖在您的本地檔案系統或指向一個特定的工作區。你可以指定任何查詢而不是鑽資料源在建立視圖體内聲明。

鑽提供了分散的中繼資料模型。鑽能夠查詢定義在資料源的中繼資料如hive,HBase和檔案系統。鑽也支援在檔案系統中的建立中繼資料。

查詢的視圖中資料:

0: jdbc:drill:> select * from custview limit 1;
+----------+-------------------+-----------+----------+--------+----------+-------------+
| cust_id  |       name        |  gender   |   age    | state  | agg_rev  | membership  |
+----------+-------------------+-----------+----------+--------+----------+-------------+
| 10001    | "Corrine Mecham"  | "FEMALE"  | "15-20"  | "va"   | 197.00   | "silver"    |
+----------+-------------------+-----------+----------+--------+----------+-------------+
1 row selected
           

通過直接從檔案系統中探測一旦使用者知道哪些資料可用,視圖可以用來讀取資料到下遊工具例如Tableau和MicroStrategy進行分析和可視化。 這些工具,出現一個視圖隻是作為一個“表”和一個可選擇的“列”。

通過資料源查詢

繼續使用 dfs.views查詢。

★★連結客戶視圖和訂單表:

0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview
where orders.cust_id=custview.cust_id
group by membership order by 2;
+------------+------------+
| membership |   sales    |
+------------+------------+
| "basic"    | 380665     |
| "silver"   | 708438     |
| "gold"     | 2787682    |
+------------+------------+
3 rows selected
           

★在這個查詢中,我們從MapR-DB表(由custview代表)讀取資料,并與訂單表資訊合并在hive中。 當做像這樣的跨資料源查詢,你需要使用完全限定 表/視圖名稱。例如,通過“hive”orders表是有字首的,那是存儲插件通過drill注冊的名稱。我們不能為“custview”使用任何字首,因為我們明确地轉換了dfs.views custview被存儲在哪的工作區。

注意:如果你的任何查詢的結果似乎被截斷,因為 行寬,設定顯示的最大寬度10000:

這組指令不使用分号。

★★★加入客戶、訂單和點選流資料:

0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview,
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c 
where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id 
group by custview.membership order by 2;
+------------+------------+
| membership |   sales    |
+------------+------------+
| "basic"    | 372866     |
| "silver"   | 728424     |
| "gold"     | 7050198    |
+------------+------------+
3 rows selected
           

★這三方結合選擇從三個不同的資料源在一次查詢查詢:

  • hive.orders表
  • custview(HBase客戶表的視圖)
  • clicks.json檔案

兩組的連接配接列加入條件是cust_id列。視圖工作區用于這個查詢以至于custview可以通路。hive.orders表也輕松查詢。

然而,注意JSON檔案不是從視圖工作區中直接可見的,是以查詢指定了檔案的完整路徑:

dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json`
           

Lesson 3: Run Queries on Complex Data Types →