天天看點

Drill 大資料進階查詢工具

簡介

Apache Drill是一個低延遲的分布式海量資料(涵蓋結構化、半結構化以及嵌套資料)互動式查詢引擎,使用ANSI SQL相容文法,支援本地檔案、HDFS、HBase、MongoDB等後端存儲,支援Parquet、JSON、CSV、TSV、PSV等資料格式。受Google的Dremel啟發,Drill滿足上千節點的PB級别資料的互動式商業智能分析場景。

安裝

Drill可以安裝在單機或者叢集環境上,支援Linux、Windows、Mac OS X系統。簡單起見,我們在Linux單機環境(CentOS 6.3)搭建以供試用。

準備安裝包:

  • jdk 7:jdk-7u75-linux-x64.tar.gz
  • Drill:apache-drill-0.8.0.tar.gz

在$WORK(/path/to/work)目錄中安裝,将jdk和drill分别解壓到java和drill目錄中,并打軟連以便更新:

.
├── drill
│   ├── apache-drill -> apache-drill-0.8.0
│   └── apache-drill-0.8.0
├── init.sh
└── java
    ├── jdk -> jdk1.7.0_75
    └── jdk1.7.0_75
           

并添加一init.sh腳本初始化java相關環境變量:

export WORK="/path/to/work"
export JAVA="$WORK/java/jdk/bin/java"
export JAVA_HOME="$WORK/java/jdk"
           

啟動

在單機環境運作隻需要啟動bin/sqlline便可:

$ cd $WORK
$ . ./init.sh
$ ./drill/apache-drill/bin/sqlline -u jdbc:drill:zk=local
Drill log directory /var/log/drill does not exist or is not writable, defaulting to ...
Apr ,  :: AM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey:  -- ::...
sqlline version .
: jdbc:drill:zk=local> 
           

-u jdbc:drill:zk=local

表示使用本機的Drill,無需啟動ZooKeeper,如果是叢集環境則需要配置和啟動ZooKeeper并填寫位址。啟動後便可以在

0: jdbc:drill:zk=local>

後敲入指令使用了。

試用

Drill的sample-data目錄有Parquet格式的示範資料可供查詢:

0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/apache-drill/sample-data/nation.parquet` limit 5;
+-------------+------------+-------------+------------+
| N_NATIONKEY |   N_NAME   | N_REGIONKEY | N_COMMENT  |
+-------------+------------+-------------+------------+
| 0           | ALGERIA    | 0           |  haggle. carefully f |
| 1           | ARGENTINA  | 1           | al foxes promise sly |
| 2           | BRAZIL     | 1           | y alongside of the p |
| 3           | CANADA     | 1           | eas hang ironic, sil |
| 4           | EGYPT      | 4           | y above the carefull |
+-------------+------------+-------------+------------+
5 rows selected (0.741 seconds)
           

這裡用的庫名格式為dfs.`本地檔案(Parquet、JSON、CSV等檔案)絕對路徑`。可以看出隻要熟悉SQL文法幾乎沒有學習成本。但Parquet格式檔案需要專用工具檢視、編輯,不是很友善,後續再專門介紹,下文先使用更通用的CSV和JSON檔案進行示範。

$WORK/data

中建立如下

test.csv

檔案:

1101,SteveEurich,Steve,Eurich,16,StoreT
1102,MaryPierson,Mary,Pierson,16,StoreT
1103,LeoJones,Leo,Jones,16,StoreTem
1104,NancyBeatty,Nancy,Beatty,16,StoreT
1105,ClaraMcNight,Clara,McNight,16,Store
           

然後查詢:

: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.csv`;
+------------+
|  columns   |
+------------+
| ["1101","SteveEurich","Steve","Eurich","16","StoreT"] |
| ["1102","MaryPierson","Mary","Pierson","16","StoreT"] |
| ["1103","LeoJones","Leo","Jones","16","StoreTem"] |
| ["1104","NancyBeatty","Nancy","Beatty","16","StoreT"] |
| ["1105","ClaraMcNight","Clara","McNight","16","Store"] |
+------------+
 rows selected (.082 seconds)
           

可以看到結果和之前的稍有不同,因為CSV檔案沒有地方存放列列名,是以統一用

columns

代替,如果需要具體制定列則需要用

columns[n]

,如:

: jdbc:drill:zk=local> select columns[], columns[] from dfs.`/path/to/work/drill/data/test.csv`;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
|        | Eurich     |
|        | Pierson    |
|        | Jones      |
|        | Beatty     |
|        | McNight    |
+------------+------------+
           

CSV檔案格式比較簡單,發揮不出Drill的強大優勢,下邊更複雜的功能使用和Parquet更接近的JSON檔案進行示範。

$WORK/data

中建立如下

test.json

檔案:

{
  "ka1": ,
  "kb1": ,
  "kc1": "vc11",
  "kd1": [
    {
      "ka2": ,
      "kb2": ,
      "kc2": "vc1010"
    }
  ]
}
{
  "ka1": ,
  "kb1": ,
  "kc1": "vc22",
  "kd1": [
    {
      "ka2": ,
      "kb2": ,
      "kc2": "vc2020"
    }
  ]
}
{
  "ka1": ,
  "kb1": ,
  "kc1": "vc33",
  "kd1": [
    {
      "ka2": ,
      "kb2": ,
      "kc2": "vc3030"
    }
  ]
}
           

可以看到這個JSON檔案内容是有多層嵌套的,結構比之前那個CSV檔案要複雜不少,而查詢嵌套資料正是Drill的優勢所在。

: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+------------+------------+
|    ka1     |    kb1     |    kc1     |    kd1     |
+------------+------------+------------+------------+
|           |         | vc11       | [{"ka2":,"kb2":,"kc2":"vc1010"}] |
|           |         | vc22       | [{"ka2":,"kb2":,"kc2":"vc2020"}] |
|           |         | vc33       | [{"ka2":,"kb2":,"kc2":"vc3030"}] |
+------------+------------+------------+------------+
 rows selected (.098 seconds)
           

select *

隻查出第一層的資料,更深層的資料隻以原本的JSON資料呈現出來,我們顯然不應該隻關心第一層的資料,具體怎麼查完全随心所欲:

: jdbc:drill:zk=local> select sum(ka1), avg(kd1[].kb2) from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
|           |        |
+------------+------------+
 row selected (. seconds)
           

可以通過

kd1[0]

來通路嵌套到第二層的這個表。

: jdbc:drill:zk=local> select kc1, kd1[].kc2 from dfs.`/path/to/work/drill/data/test.json` where kd1[].kb2 =  and ka1 = ;
+------------+------------+
|    kc1     |   EXPR$1   |
+------------+------------+
| vc11       | vc101     |
+------------+------------+
 row selected (. seconds)
           

建立view:

: jdbc:drill:zk=local> create view dfs.tmp.tmpview as select kd1[].kb2 from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | View 'tmpview' created successfully in 'dfs.tmp' schema |
+------------+------------+
 row selected ( seconds)

0: jdbc:drill:zk=local> select * from dfs.tmp.tmpview;
+------------+
|   EXPR$   |
+------------+
|        |
|        |
|        |
+------------+
 rows selected ( seconds)

           

可以把嵌套的第二層表打平(整合kd1[0]..kd1[n]):

: jdbc:drill:zk=local> select kddb.kdtable.kc2 from (select flatten(kd1) kdtable from dfs.`/path/to/work/drill/data/test.json`) kddb;
+------------+
|   EXPR$0   |
+------------+
| vc101     |
| vc202     |
| vc303     |
+------------+
 rows selected (.083 seconds)
           

使用細節上和mysql還是有所不同的,另外涉及到多層表的複雜邏輯,要想用得得心應手還需要仔細閱讀官方文檔并多多練習...

自我總結一下:

Drill  可以放入的資料類型:

  • MapR-FS
  • MapR-DB
  • MongoDB
  • File system
  • Hive
  • HBase

詳細參見:http://doc.mapr.com/display/MapR/Connecting+Apache+Drill+to+Data+Sources

目前沒有發現可以在drill上實作insert,update,delete的操作的代碼,

https://github.com/apache/drill/blob/master/exec/jdbc/src/test/java/org/apache/drill/jdbc/test/  上也沒有相關的描述說是可以insert,update和delete。

官方文檔的學習位址是:

http://www.javacodegeeks.com/2015/02/analyze-highly-dynamic-datasets-apache-drill.html

實踐為主,書序都是扯淡的話。。。

原文:http://segmentfault.com/a/1190000002652348