sqoop导入mysql到hive_sqoop 从mysql直接导入到hive表

mysql的数据库数据过大，做数据分析，需要从 mysql 转向hadoop。

1，遇到的问题

从mysql转数据到hive中，本想用parquet格式，但是一直都没有成功，提示

Hive import and create hive table is not compatible with importing into ParquetFile format.

sqoop不管是mysql直接到hive。还是把mysql导出成parquet文件，然后在把parquet文件，在导入到hive的外部表，都没有成功

存为avro格式也是一样。

2，安装sqoop

下载：http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/

# tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

# cp -r sqoop-1.4.7.bin__hadoop-2.6.0 /bigdata/sqoop

3，配置sqoop

3.1，配置用户环境变量

# cd ~

# vim .bashrc

export SQOOP_HOME=/bigdata/sqoop

export PATH=$ZOOKEEPER_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:/bigdata/hadoop/bin:$SQOOP_HOME/bin:$PATH

# source .bashrc

3.2，配置sqoop-env.sh

# vim /bigdata/sqoop/sqoop-env.sh

#Set path to where bin/hadoop is available

export HADOOP_COMMON_HOME=/bigdata/hadoop

#Set path to where hadoop-*-core.jar is available

export HADOOP_MAPRED_HOME=/bigdata/hadoop

#set the path to where bin/hbase is available

#export HBASE_HOME=

#Set the path to where bin/hive is available

export HIVE_HOME=/bigdata/hive

export HIVE_CONF_DIR=/bigdata/hive/conf //要加上，不然会提示hiveconf找不到

#Set the path for where zookeper config dir is

export ZOOCFGDIR=/bigdata/zookeeper/conf

3.3，导入数据

# sqoop import \

--connect jdbc:mysql://10.0.0.237:3306/bigdata \

--username root \

--password ******* \

--table track_app \

-m 1 \

--warehouse-dir /user/hive/warehouse/tanktest.db \

--hive-database tanktest \

--create-hive-table \

--hive-import \

--hive-table track_app

这样就可以导入了，不过导入hive后，在hdfs上面存储的文件格式是文本形势。

hive> describe formatted track_app;

# col_name data_type comment

id int

log_date int

log_time int

user_id int

ticket string

# Detailed Table Information

Database: tanktest

Owner: root

CreateTime: Fri Feb 15 18:08:55 CST 2019

LastAccessTime: UNKNOWN

Retention: 0

Location: hdfs://bigserver1:9000/user/hive/warehouse/tanktest.db/track_app

Table Type: MANAGED_TABLE

Table Parameters:

comment Imported by sqoop on 2019/02/15 18:08:42

numFiles 1

numRows 0

rawDataSize 0

totalSize 208254

transient_lastDdlTime 1550225337

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat //文本格式，也可以在hdfs上面，打开文件查看内容

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

field.delim \u0001

line.delim \n

serialization.format \u0001

Time taken: 0.749 seconds, Fetched: 56 row(s)

注意：

导致hive表后，通过sql(hive，spark-sql)的方式，看一下，能不能查询到数据。如果查不到数据，说明没有导入成功。

4，sqoop参数

Import和export参数解释

Common arguments:

--connect ：连接RDBMS的jdbc连接字符串，例如：–connect jdbc:mysql:// MYSQL_SERVER:PORT/DBNAME。

--connection-manager ：

--hadoop-home ：

--username ：连接RDBMS所使用的用户名。

--password ：连接RDBMS所使用的密码，明文。

--password-file ：使用文件存储密码。

-p ：交互式连接RDBMS的密码。

Import control arguments:

--append ：追加数据到HDFS已经存在的文件中。

--as-sequencefile ：import序列化的文件。

--as-textfile ：import文本文件，默认。

--columns

：指定列import，逗号分隔，比如：–columns “id,name”。

--delete-target-dir ：删除存在的import目标目录。

--direct ：直连模式，速度更快(HBase不支持)

--split-by ：分割导入任务所使用的字段，需要明确指定，推荐使用主键。

--inline-lob-limit < n > ：设置内联的BLOB对象的大小。

--fetch-size ：一次从数据库读取n个实例，即n条数据。

-e,--query ：构建表达式执行。

--target-dir ：指定HDFS目标存储目录。

--warehouse-dir ：可以指定为-warehouse-dir/user/hive/warehouse/即导入数据的存放路径，如果该路径不存在，会首先创建。

--table ：将要导入到hive的表。

--where ：指定where从句，如果有双引号，注意转义 \$CONDITIONS，不能用or，子查询，join。

-z,--compress ：开启压缩。

--null-string ：string列为空指定为此值。

--null-non-string ：非string列为空指定为此值，-null这两个参数are optional, 如果不设置，会指定为”null”。

--autoreset-to-one-mapper ：如果没有主键和split-by用one mapper import (split-by和此选项不共存)。

-m,--num-mappers ：建立n个并发执行import，默认4个线程。

Incremental import arguments:

--check-column ：Source column to check for incremental change

--incremental ：Define an incremental import of type ‘append’ or ‘lastmodified’

--last-value ：Last imported value in the incremental check column

Hive arguments:

--create-hive-table ：自动推断表字段类型直接建表，hive-overwrite功能可以替代掉了，但Hive里此表不能存在，不然操作会报错。

--hive-database ：指定要把HDFS数据导入到哪个Hive库。

--hive-table ：设置到Hive当中的表名。

--hive-delims-replacement ：导入到hive时用自定义的字符替换掉\n, \r, and \01。

--hive-drop-import-delims ：导入到hive时删除字段中\n, \r，\t and \01等符号；避免字段中有空格导致导入数据被截断。

--hive-home

--hive-import ：将HDFS数据导入到Hive中，会自动创建Hive表，使用hive的默认分隔符。

--hive-overwrite ：对Hive表进行覆盖操作(需配合--hive-import使用，如果Hive里没有表会先创建之)，不然就是追加数据。

--hive-partition-key ：hive分区的key。

--hive-partition-value ：hive分区的值。

--map-column-hive ：类型匹配，SQL类型对应到hive类型。

HBase arguments:

--column-family < family > ：把内容导入到hbase当中，默认是用主键作为split列。

--hbase-create-table ：创建Hbase表。

--hbase-row-key < col > ：指定字段作为row key ，如果输入表包含复合主键，用逗号分隔。

--hbase-table < table-name > ：指定hbase表。