版權聲明:本文為部落客原創文章,未經部落客允許不得轉載。 https://blog.csdn.net/SunnyYoona/article/details/53163460
Sqoop的本質還是一個指令行工具,和HDFS,MapReduce相比,并沒有什麼高深的理論。
我們可以通過sqoop help指令來檢視sqoop的指令選項,如下:
-
16/11/13 20:10:17 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
usage: sqoop COMMAND [ARGS]
-
Available commands:
-
codegen Generate code to interact with database records
-
create-hive-table Import a table definition into Hive
-
eval Evaluate a SQL statement and display the results
-
export Export an HDFS directory to a database table
-
help List available commands
-
import Import a table from a database to HDFS
-
import-all-tables Import tables from a database to HDFS
-
import-mainframe Import datasets from a mainframe server to HDFS
-
job Work with saved jobs
-
list-databases List available databases on a server
-
list-tables List available tables in a database
-
merge Merge results of incremental imports
-
metastore Run a standalone Sqoop metastore
-
version Display version information
-
See 'sqoop help COMMAND' for information on a specific command.
其中使用頻率最高的選項還是import 和 export 選項。
1. codegen
将關系型資料庫表的記錄映射為一個Java檔案,Java class類以及相關的jar包,該指令将資料庫表的記錄映射為一個Java檔案,在該Java檔案中對應有表的各個字段。生成的jar和class檔案在Metastore功能使用時會用到。該指令選項的參數如下圖所示:
舉例:
-
sqoop codegen --connect jdbc:mysql://localhost:3306/test --table order_info -outdir /home/xiaosi/test/ --username root -password root
上面執行個體以test資料庫的order_info表來生成Java代碼,其中-outdir指定了Java代碼生成的路徑
運作結果資訊如下:
-
16/11/13 21:50:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
Enter password:
-
16/11/13 21:50:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-
16/11/13 21:50:38 INFO tool.CodeGenTool: Beginning code generation
-
16/11/13 21:50:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_info` AS t LIMIT 1
-
16/11/13 21:50:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_info` AS t LIMIT 1
-
16/11/13 21:50:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/hadoop-2.7.2
-
注: /tmp/sqoop-xiaosi/compile/ea41fe40e1f12f6b052ad9fe4a5d9710/order_info.java使用或覆寫了已過時的 API。
-
注: 有關詳細資訊, 請使用 -Xlint:deprecation 重新編譯。
-
16/11/13 21:50:39 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-xiaosi/compile/ea41fe40e1f12f6b052ad9fe4a5d9710/order_info.jar
我們還可以使用-bindir指定編譯成的class檔案以及将生成檔案打包為jar的jar封包件輸出路徑:
-
16/11/13 21:53:55 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
Enter password:
-
16/11/13 21:53:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-
16/11/13 21:53:58 INFO tool.CodeGenTool: Beginning code generation
-
16/11/13 21:53:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_info` AS t LIMIT 1
-
16/11/13 21:53:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `order_info` AS t LIMIT 1
-
16/11/13 21:53:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/hadoop-2.7.2
-
注: /home/xiaosi/data/order_info.java使用或覆寫了已過時的 API。
-
注: 有關詳細資訊, 請使用 -Xlint:deprecation 重新編譯。
-
16/11/13 21:53:59 INFO orm.CompilationManager: Writing jar file: /home/xiaosi/data/order_info.jar
上面執行個體指定編譯成的class檔案(order_info.class)以及将生成檔案打包為jar的jar封包件(order_info.jar)輸出路徑為/home/xiaosi/data路徑,java檔案(order_info.java)路徑為/home/xiaosi/test
2. create-hive-table
這個指令上一篇文章[Sqoop導入與導出]中已經使用過了,作用就是生成與關系資料庫表的表結構對應的Hive表。該指令選項的參數如下圖所示:
-
sqoop create-hive-table --connect jdbc:mysql://localhost:3306/test --table employee --username root -password root --fields-terminated-by ','
3. eval
eval指令選項可以讓Sqoop使用SQL語句對關系性資料庫進行操作,在使用import這種工具進行資料導入的時候,可以預先了解相關的SQL語句是否正确,并能将結果顯示在控制台。
3.1 選擇查詢評估計算
使用eval工具,我們可以評估計算任何類型的SQL查詢。我們以test資料庫的order_info表為例子:
-
sqoop eval --connect jdbc:mysql://localhost:3306/test --username root --query "select * from order_info limit 3" -P
運作結果資訊:
-
16/11/13 22:25:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
Enter password:
-
16/11/13 22:25:22 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-
------------------------------------------------------------
-
| id | order_time | business |
-
------------------------------------------------------------
-
| 358574046793404 | 2016-04-05 | flight |
-
| 358574046794733 | 2016-08-03 | hotel |
-
| 358574050631177 | 2016-05-08 | vacation |
-
------------------------------------------------------------
3.2 插入評估計算
Sqoop的eval工具可以适用于兩個模拟和定義的SQL語句。這意味着,我們可以使用eval的INSERT語句了。下面的指令用于在test資料庫的order_info表中插入新行:
-
sqoop eval --connect jdbc:mysql://localhost:3306/test --username root --query "insert into order_info (id, order_time, business) values('358574050631166', '2016-11-13', 'hotel')" -P
運作結果資訊輸出:
-
16/11/13 22:29:42 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
Enter password:
-
16/11/13 22:29:44 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-
16/11/13 22:29:44 INFO tool.EvalSqlTool: 1 row(s) updated.
如果指令成功執行,會在控制台上顯示更新的行的狀态。或者我們可以在mysql中查詢我們剛插入的那條資訊:
-
mysql> select * from order_info where id = "358574050631166";
-
+-----------------+------------+----------+
-
| id | order_time | business |
-
+-----------------+------------+----------+
-
| 358574050631166 | 2016-11-13 | hotel |
-
+-----------------+------------+----------+
-
1 row in set (0.00 sec)
4. export
從HDFS中将資料導出到關系性資料庫中。該指令選項的參數如下圖所示:
在HDFS檔案中的員工資料的一個例子,資料如下:
-
hadoop fs -text /user/xiaosi/employee/* | less
-
yoona,qunar,創新事業部
-
xiaosi,qunar,創新事業部
-
jim,ali,淘寶
-
kom,ali,淘寶
-
lucy,baidu,搜尋
-
jim,ali,淘寶
在将HDFS中資料導出到關系性資料庫時,必須在關系性資料庫中建立一張來接受資料的表,如下:
-
CREATE TABLE `employee` (
-
`name` varchar(255) DEFAULT NULL,
-
`company` varchar(255) DEFAULT NULL,
-
`depart` varchar(255) DEFAULT NULL
-
);
下面執行導出操作,指令如下:
-
sqoop export --connect jdbc:mysql://localhost:3306/test --table employee --export-dir /user/xiaosi/employee --username root -m 1 --fields-terminated-by ',' -P
-
16/11/13 23:40:49 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
-
16/11/13 23:40:49 INFO mapreduce.Job: Running job: job_local611430785_0001
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner: OutputCommitter set in config null
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.sqoop.mapreduce.NullOutputCommitter
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner: Waiting for map tasks
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner: Starting task: attempt_local611430785_0001_m_000000_0
-
16/11/13 23:40:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
-
16/11/13 23:40:49 INFO mapred.MapTask: Processing split: Paths:/user/xiaosi/employee/part-m-00000:0+120
-
16/11/13 23:40:49 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
-
16/11/13 23:40:49 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
-
16/11/13 23:40:49 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
-
16/11/13 23:40:49 INFO mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner:
-
16/11/13 23:40:49 INFO mapred.Task: Task:attempt_local611430785_0001_m_000000_0 is done. And is in the process of committing
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner: map
-
16/11/13 23:40:49 INFO mapred.Task: Task 'attempt_local611430785_0001_m_000000_0' done.
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local611430785_0001_m_000000_0
-
16/11/13 23:40:49 INFO mapred.LocalJobRunner: map task executor complete.
-
16/11/13 23:40:50 INFO mapreduce.Job: Job job_local611430785_0001 running in uber mode : false
-
16/11/13 23:40:50 INFO mapreduce.Job: map 100% reduce 0%
-
16/11/13 23:40:50 INFO mapreduce.Job: Job job_local611430785_0001 completed successfully
-
16/11/13 23:40:50 INFO mapreduce.Job: Counters: 20
-
File System Counters
-
FILE: Number of bytes read=22247825
-
FILE: Number of bytes written=22732498
-
FILE: Number of read operations=0
-
FILE: Number of large read operations=0
-
FILE: Number of write operations=0
-
HDFS: Number of bytes read=126
-
HDFS: Number of bytes written=0
-
HDFS: Number of read operations=12
-
HDFS: Number of large read operations=0
-
HDFS: Number of write operations=0
-
Map-Reduce Framework
-
Map input records=6
-
Map output records=6
-
Input split bytes=136
-
Spilled Records=0
-
Failed Shuffles=0
-
Merged Map outputs=0
-
GC time elapsed (ms)=0
-
Total committed heap usage (bytes)=245366784
-
File Input Format Counters
-
Bytes Read=0
-
File Output Format Counters
-
Bytes Written=0
-
16/11/13 23:40:50 INFO mapreduce.ExportJobBase: Transferred 126 bytes in 2.3492 seconds (53.6344 bytes/sec)
-
16/11/13 23:40:50 INFO mapreduce.ExportJobBase: Exported 6 records.
導出完畢之後,我們可以在mysql中通過employee表進行查詢:
-
mysql> select name, company from employee;
-
+--------+---------+
-
| name | company |
-
+--------+---------+
-
| yoona | qunar |
-
| xiaosi | qunar |
-
| jim | ali |
-
| kom | ali |
-
| lucy | baidu |
-
| jim | ali |
-
+--------+---------+
-
6 rows in set (0.00 sec)
5. import
将資料表中的資料導入HDFS或者Hive中,該指令選項的參數如下圖所示:
-
sqoop import --connect jdbc:mysql://localhost:3306/test --target-dir /user/xiaosi/data/order_info --query 'select * from order_info where $CONDITIONS' -m 1 --username root -P
如上代碼從查詢結果中導入資料到HDFS中,存儲路徑由--target-dir參數指定。這裡,使用了--query選項,不能同時與--table選項使用。同時,變量$CONDITIONS必須在WHERE語句之後,供Sqoop程序運作指令過程中使用。
-
16/11/14 12:08:50 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
-
16/11/14 12:08:50 INFO mapreduce.Job: Running job: job_local127577466_0001
-
16/11/14 12:08:50 INFO mapred.LocalJobRunner: OutputCommitter set in config null
-
16/11/14 12:08:50 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
-
16/11/14 12:08:50 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
-
16/11/14 12:08:50 INFO mapred.LocalJobRunner: Waiting for map tasks
-
16/11/14 12:08:50 INFO mapred.LocalJobRunner: Starting task: attempt_local127577466_0001_m_000000_0
-
16/11/14 12:08:50 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
-
16/11/14 12:08:50 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
-
16/11/14 12:08:50 INFO db.DBInputFormat: Using read commited transaction isolation
-
16/11/14 12:08:50 INFO mapred.MapTask: Processing split: 1=1 AND 1=1
-
16/11/14 12:08:50 INFO db.DBRecordReader: Working on split: 1=1 AND 1=1
-
16/11/14 12:08:50 INFO db.DBRecordReader: Executing query: select * from order_info where ( 1=1 ) AND ( 1=1 )
-
16/11/14 12:08:50 INFO mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
-
16/11/14 12:08:50 INFO mapred.LocalJobRunner:
-
16/11/14 12:08:51 INFO mapred.Task: Task:attempt_local127577466_0001_m_000000_0 is done. And is in the process of committing
-
16/11/14 12:08:51 INFO mapred.LocalJobRunner:
-
16/11/14 12:08:51 INFO mapred.Task: Task attempt_local127577466_0001_m_000000_0 is allowed to commit now
-
16/11/14 12:08:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local127577466_0001_m_000000_0' to hdfs://localhost:9000/user/xiaosi/data/order_info/_temporary/0/task_local127577466_0001_m_000000
-
16/11/14 12:08:51 INFO mapred.LocalJobRunner: map
-
16/11/14 12:08:51 INFO mapred.Task: Task 'attempt_local127577466_0001_m_000000_0' done.
-
16/11/14 12:08:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local127577466_0001_m_000000_0
-
16/11/14 12:08:51 INFO mapred.LocalJobRunner: map task executor complete.
-
16/11/14 12:08:51 INFO mapreduce.Job: Job job_local127577466_0001 running in uber mode : false
-
16/11/14 12:08:51 INFO mapreduce.Job: map 100% reduce 0%
-
16/11/14 12:08:51 INFO mapreduce.Job: Job job_local127577466_0001 completed successfully
-
16/11/14 12:08:51 INFO mapreduce.Job: Counters: 20
-
File System Counters
-
FILE: Number of bytes read=22247784
-
FILE: Number of bytes written=22732836
-
FILE: Number of read operations=0
-
FILE: Number of large read operations=0
-
FILE: Number of write operations=0
-
HDFS: Number of bytes read=0
-
HDFS: Number of bytes written=3710
-
HDFS: Number of read operations=4
-
HDFS: Number of large read operations=0
-
HDFS: Number of write operations=3
-
Map-Reduce Framework
-
Map input records=111
-
Map output records=111
-
Input split bytes=87
-
Spilled Records=0
-
Failed Shuffles=0
-
Merged Map outputs=0
-
GC time elapsed (ms)=0
-
Total committed heap usage (bytes)=245366784
-
File Input Format Counters
-
Bytes Read=0
-
File Output Format Counters
-
Bytes Written=3710
-
16/11/14 12:08:51 INFO mapreduce.ImportJobBase: Transferred 3.623 KB in 2.5726 seconds (1.4083 KB/sec)
-
16/11/14 12:08:51 INFO mapreduce.ImportJobBase: Retrieved 111 records.
我們可以檢視HDFS由參數--target-dir指定的路徑檢視導入的資料:
-
hadoop fs -text /user/xiaosi/data/order_info/* | less
-
358574046793404,2016-04-05,flight
-
358574046794733,2016-08-03,hotel
-
358574050631177,2016-05-08,vacation
-
358574050634213,2015-04-28,train
-
358574050634692,2016-04-05,tuan
-
358574050650524,2015-07-26,hotel
-
358574050654773,2015-01-23,flight
-
358574050668658,2015-01-23,hotel
-
358574050730771,2016-11-06,train
-
358574050731241,2016-05-08,car
-
358574050743865,2015-01-23,vacation
-
358574050767666,2015-04-28,train
-
358574050767971,2015-07-26,flight
-
358574050808288,2016-05-08,hotel
-
358574050816828,2015-01-23,hotel
-
358574050818220,2015-04-28,car
-
358574050821877,2013-08-03,flight
再看一個例子:
-
sqoop import --connect jdbc:mysql://localhost:3306/test --table order_info --columns "business,id,order_time" -m 1 --username root -P
HDFS上會在/user/xiaosi/目錄下新增一個目錄order_info,與關系性資料庫的表名一緻,内容如下:
-
flight,358574046793404,2016-04-05
-
hotel,358574046794733,2016-08-03
-
vacation,358574050631177,2016-05-08
-
train,358574050634213,2015-04-28
-
tuan,358574050634692,2016-04-05
6. import-all-tables
将資料庫裡的所有表導入HDFS中,每個表在HDFS中對應一個獨立的目錄。該指令選項的參數如下圖所示:
7. list-databases
該指令選項可以列出關系性資料庫的所有資料庫名,指令如下:
-
sqoop list-databases --connect jdbc:mysql://localhost:3306 --username root -P
-
16/11/14 14:30:11 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
Enter password:
-
16/11/14 14:30:14 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-
information_schema
-
hive_db
-
mysql
-
performance_schema
-
phpmyadmin
-
test
8. list-tables
該指令選項可以列出關系性資料庫的某一個資料庫的所有表名,指令如下:
-
sqoop list-tables --connect jdbc:mysql://localhost:3306/test --username root -P
-
16/11/14 14:32:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
Enter password:
-
16/11/14 14:32:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-
PageView
-
book
-
bookID
-
cc
-
city_click
-
country
-
country2
-
cup
-
employee
-
flightOrder
-
hotel_book_info
-
hotel_info
-
order_info
-
stu
-
stu2
-
stu3
-
stuInfo
-
student
9. merge
該指令選項的作用是将HDFS上的兩份資料進行合并,在合并的同時進行資料去重。該指令選項的參數如下圖所示:
例如,在HDFS的路徑/user/xiaosi/old下由一份導入資料,如下:
-
id name
-
1 a
-
2 b
-
3 c
在HDFS的路徑/user/xiaosi/new下也有一份資料,但是在導入時間在第一份之後,如下:
-
id name
-
1 a2
-
2 b
-
3 c
那麼合并的結果為:
-
id name
-
1 a2
-
2 b
-
3 c
運作如下指令:
-
sqoop merge -new-data /user/xiaosi/new/part-m-00000 -onto /user/xiaosi/old/part-m-00000 -target-dir /user/xiaosi/final -jar-file /home/xiaosi/test/testmerge.jar -class-name testmerge -merge-key id
備注:
在一份資料集中,多行不應具有相同的主鍵,否則會發生資料丢失。
10. metastore
記錄Sqoop作業的中繼資料資訊,如果不啟動Metastore執行個體,則預設的中繼資料存儲目錄為~/.sqoop。如果要更改存儲目錄,可以在配置檔案sqoop-site.xml中進行更改。
啟動Metastore執行個體:
-
sqoop metastore
-
16/11/14 14:44:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
16/11/14 14:44:40 WARN hsqldb.HsqldbMetaStore: The location for metastore data has not been explicitly set. Placing shared metastore files in /home/xiaosi/.sqoop/shared-metastore.db
-
[Server@52308be6]: [Thread[main,5,main]]: checkRunning(false) entered
-
[Server@52308be6]: [Thread[main,5,main]]: checkRunning(false) exited
-
[Server@52308be6]: [Thread[main,5,main]]: setDatabasePath(0,file:/home/xiaosi/.sqoop/shared-metastore.db)
-
[Server@52308be6]: [Thread[main,5,main]]: checkRunning(false) entered
-
[Server@52308be6]: [Thread[main,5,main]]: checkRunning(false) exited
-
[Server@52308be6]: [Thread[main,5,main]]: setDatabaseName(0,sqoop)
-
[Server@52308be6]: [Thread[main,5,main]]: putPropertiesFromString(): [hsqldb.write_delay=false]
-
[Server@52308be6]: [Thread[main,5,main]]: checkRunning(false) entered
-
[Server@52308be6]: [Thread[main,5,main]]: checkRunning(false) exited
-
[Server@52308be6]: Initiating startup sequence...
-
[Server@52308be6]: Server socket opened successfully in 3 ms.
-
[Server@52308be6]: Database [index=0, id=0, db=file:/home/xiaosi/.sqoop/shared-metastore.db, alias=sqoop] opened sucessfully in 153 ms.
-
[Server@52308be6]: Startup sequence completed in 157 ms.
-
[Server@52308be6]: 2016-11-14 14:44:40.414 HSQLDB server 1.8.0 is online
-
[Server@52308be6]: To close normally, connect and execute SHUTDOWN SQL
-
[Server@52308be6]: From command line, use [Ctrl]+[C] to abort abruptly
-
16/11/14 14:44:40 INFO hsqldb.HsqldbMetaStore: Server started on port 16000 with protocol HSQL
11. job
該指令選項可以生産一個Sqoop的作業,但是不會立即執行,需要手動執行,該指令選項目的在于盡可能的服用Sqoop指令。該指令選項的參數如下圖所示:
-
sqoop job -create listTablesJob -- list-tables --connect jdbc:mysql://localhost:3306/test --username root -P
上面代碼實作一個job,顯示關系性資料庫test資料庫中所有的表。
-
sqoop job -exec listTablesJob
上面代碼執行我們已經定義好的Job,輸出結果資訊如下:
-
16/11/14 19:51:44 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
-
Enter password:
-
16/11/14 19:51:47 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-
PageView
-
book
-
bookID
-
cc
-
city_click
-
country
-
country2
-
cup
-
employee
-
flightOrder
-
hotel_book_info
-
hotel_info
-
order_info
-
stu
-
stu2
-
stu3
-
stuInfo
-
student
-- 和 list-tables(Job 所要執行的Sqoop指令) 不能挨着。
來自于《Hadoop海量資料處理 技術詳解與項目實戰》