Hive大資料項目實踐

在搭建了Hadoop和hive環境後，就可以使用hive來進行資料庫相關操作了。Hive提供了hql（類sql）語句來操作，基本過程與mysql類似，差別的就是對于hive中的聚合操作，将使用hadoop底層的mapreduce程序來執行。

下面以一個遊戲公司的遊戲、使用者等相關分析大資料業務為例，以Hive為工具來完成遊戲活躍度、使用者使用情況等的統計分析工作。

（1）資料的産生

因為擷取遊戲公司的實際資料還是比較困難的，我們直接自己來建構。使用python腳本就可以完成。由于hadoop和hive都是安裝在centos系統上，centos預設安裝了python2.7，是以可以直接編寫腳本，然後在centos上運作得到結果。

使用者資料的建構，模拟産生1000個使用者

import random

def getUser():
    location_List= ['BJ','SH','TJ','GZ','SZ']
    fd= fopen('userinfo','w+')
    for i in range(1000):
        userid = str(1000+i)
        age = str(random.randrange(10,40))
        area = random.choice(location_List)
        user_money = str(i)
        str_tmp = userid + ','+ age+ ',' + area + ',' + usermoney + '\n'
        fd.write(str_tmp)
    fd.close()

if __name__=='__main__':
    getUser()

遊戲資訊的建構，模拟共4個遊戲：

def getGame():
    Game_list= ['CHESS','LANDLORD','QGAME','ROYAL']
    fd =fopen('gameinfo','w+')
    for i in range(4):
        gameid = str(i)
        gamename = Game_list[i]
        str_tmp = gameid+ ',' +gamename + '\n'
        fd.write(str_tmp)
    fd.close()

if __name__=='__main__':
    getGame()

使用者玩遊戲時間資料建構，模拟一共10天的使用者玩遊戲時間的記錄

import datetime
from datetime import timedelta
import random

def gameTime():
    game_list = [0, 1, 2,3]
    time_list= [10,15,20,20,50,60,90]

    for j in range(10):

        fdate = (datetime.datetime.now() + datetime.timedelta(days=j)).strftime('%Y-%m-%d')

        fd = open('gametime\\{}\\gametime_{}.txt'.format(fdate,fdate), 'w+')
        for i in range(1000):
            userid = str(1000 + i)
            gameid = str(random.choice(game_list))
            gametime = str(random.choice(time_list))
            str_tmp = fdate+ ','+ userid+ ','+ gameid + ',' + gametime  + '\n'                                                                                                                           + '\n'
            fd.write(str_tmp)
    fd.close()


if __name__=='__main__':
    gameTime()

使用者玩遊戲費用資料建構，模拟這10天裡每天的費用：

import random,datetime
from datetime import timedelta

def userFee():
    game_list = [0, 1, 2,3]
    money_list= [10,15,30,27,55,66,90]
    for j in range(10):
        fdate = (datetime.datetime.now() + datetime.timedelta(days=j)).strftime('%Y-%m-%d')
        os.mkdir('userfee\\{}'.format(fdate))

    for j in range(10):
        fdate = (datetime.datetime.now() + datetime.timedelta(days=j)).strftime('%Y-%m-%d')
        fd = open('userfee\\{}\\userfee_{}.txt'.format(fdate,fdate), 'w+')
        for i in range(1000):

            userid = str(1000 + i)
            gameid = str(random.choice(game_list))
            gametime = str(random.choice(money_list))
            str_tmp = fdate+ ',' + userid+ ','+ gameid + ',' + gametime  + '\n'                                                                                                                           + '\n'
            fd.write(str_tmp)
    fd.close()

if __name__=='__main__':
    userFee()

（2）資料的Hive存儲

上述的使用者資料都是以檔案方式存儲下來的，接下來我們将其存儲到Hadoop上。使用的時候就是利用Hive以建表建立資料、插入資料等方式來實作。

首先在hadoop中建立如下檔案夾，用于設定hive存儲位置。

[[email protected] ~]$ hdfs dfs -mkdir /stat
[[email protected] ~]$ hdfs dfs -mkdir /stat/data/
[[email protected] ~]$ hdfs dfs -mkdir /stat/data/gameinfo
[[email protected] ~]$ hdfs dfs -mkdir /stat/data/gametime
[[email protected] ~]$ hdfs dfs -mkdir /stat/data/userinfo
[[email protected] ~]$ hdfs dfs -mkdir /stat/data/userfee

然後進入hive shell指令端，開始建立stat資料庫，用于存放上述生産的資料，同時建立一個analysis資料庫，用于存放hive統計分析資料。

hive> create database stat;
hive> create database analysis;

接下來就可以使用hive建立表的指令建立4個表，gameinfo,gametime,userinfo,userfee：

hive > use stat;
hive > create table if not EXISTS gameinfo ( gameid int, gamename string) 
     > row format delimited fields terminated by ','
     > location '/stat/data/gameinfo';
hive > create table if not EXISTS userinfo ( userid int, age int, area string , money int) 
     > row format delimited fields terminated by ','
     > location '/stat/data/userinfo';
hive > create table if not EXISTS gametime ( date string, userid int, gameid int, gametime int) 
     > partitioned by (dt string )
     > row format delimited fields terminated by ','
     > location '/stat/data/gametime';
hive > create table if not EXISTS userfee ( date string, userid int, gameid int, fee int) 
     > partitioned by (st string)
     > row format delimited fields terminated by ','
     > location '/stat/data/userfee';

注意到建表時各個字段與第一步建構資料時要對應，這樣保證後面資料能夠正常導入。其中gametime和userfee都是涉及到分區，因為有每日的資料需要單獨進行存儲，是以在建立表時就設定好分區。

有了表名和字段定義後，可以導入資料了。

如下将userinfo和gameinfo的資料存入hive：

hive > load data local inpath 'userinfo.txt' into table stat.userinfo ; 
hive > load data local inpath 'gameinfo.txt' into table stat.gameinfo ;

對于後續兩個分區表，則采用alter table add方式來導入：

hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-09')
     > location '/stat/data/gametime/2020-02-09/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-10')
     > location '/stat/data/gametime/2020-02-10/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-11')
     > location '/stat/data/gametime/2020-02-11/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-12')
     > location '/stat/data/gametime/2020-02-12/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-13')
     > location '/stat/data/gametime/2020-02-13/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-14')
     > location '/stat/data/gametime/2020-02-14/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-15')
     > location '/stat/data/gametime/2020-02-15/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-16')
     > location '/stat/data/gametime/2020-02-16/';
hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-17')
     > location '/stat/data/gametime/2020-02-17/';

hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-09')
     > location '/stat/data/userfee/2020-02-09/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-10')
     > location '/stat/data/userfee/2020-02-10/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-11')
     > location '/stat/data/userfee/2020-02-11/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-12')
     > location '/stat/data/userfee/2020-02-12/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-13')
     > location '/stat/data/userfee/2020-02-13/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-14')
     > location '/stat/data/userfee/2020-02-14/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-15')
     > location '/stat/data/userfee/2020-02-15/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-16')
     > location '/stat/data/userfee/2020-02-16/';
hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-17')
     > location '/stat/data/userfee/2020-02-17/';

這樣就把生産的遊戲時間、使用者費用資料存入了hdfs中。其實可以觀察到，對于後面的兩個使用者遊戲時間、使用者遊戲費用資料的存儲，直接使用hive來操作還是比較笨重的，畢竟語句之間差别的就是日期變量。如果能使用腳本來傳參，将日期作為變量傳入執行腳本，執行效率會高很多。這裡就需要啟動hiveserver2程序，使用jdbc或者beeline來進行操作，尤其可以使用javaAPI或者python來遠端通路，編寫腳本來實作hive資料庫的管理效率會跟多。

如下為hdfs資料結構：

Hive大資料項目實踐

進入hive，查詢一下導入資料的情況：

hive> select * from stat.gameinfo;
OK
0       LandLord
1       Buffle
2       Farm
3       PuQ
Time taken: 0.212 seconds, Fetched: 4 row(s)
hive> select * from stat.gametime where dt='2020-02-09' limit 1;
OK
2020-02-09      1000    1       60      2020-02-09
Time taken: 0.498 seconds, Fetched: 1 row(s)

（3）資料的hive統計分析

前面的4個業務相關資料表，我們需要經過統計分析來獲得相應的規律資訊。如通過遊戲資訊表、使用者玩遊戲時間表關聯分析可獲得每個遊戲每天的活躍度（統計每個遊戲每天有多少使用者在玩，進而分析最喜愛遊戲進行推薦）；通過使用者資訊表、使用者遊戲消費表關聯分析來獲得基于年齡的付費資訊，即看遊戲對哪個年齡段最有吸引力；通過遊戲資訊表、使用者遊戲消費表關聯分析獲得每個遊戲獲得的收入金額等。在hive shell輸入hql語句，用于分析統計類的，包括各種聚合、關聯條件等，執行時系統自動将任務交給底層的mapreduce來執行。此時我們将使用另外一個資料庫analysis。

首先來寫一下hql語句，實作遊戲活躍度的統計：

hql= ''' insert overwrite table analysis.gameactive partition (dt='2020-02-17')

select gt.fdate as fdate,gi.fgamename as fgamename,count(gt.fuserid) as fcount from stat.gametime gt,stat.gameinfo as gi

where gt.fgameid=gi.fgameid and fdate='2020-02-17' group by gt.fdate,gi.fgamename ''';

如果表比較多的時候，還可以這樣來使用：

insert overwrite table analysis.gameactive partition (dt='2020-02-17') 
select gt.fdate as fdate,gi.fgamename as fgamename,count(gt.fuserid) as fcount 
from stat.gametime gt 
left join stat.gameinfo gi 
on gt.fgameid = gi.fgameid 
where fdate ='2020-02-17' 
group by gt.fdate,gi.fgamename;

語句中有left join，左連接配接的方式，left join... on...，on後面為關聯方式。然後接where語句，查詢條件。最後在使用group by分組時，注意到分組所用的字段，一定是需要前面select語句裡出現過的屬性。

同樣直接在hive shell指令行視窗輸入：

hive > insert overwrite table analysis.gameactive partition (dt='2020-02-17')
hive > select gt.fdate as fdate,gi.fgamename as fgamename,count(gt.fuserid) as fcount 
hive > from stat.gametime gt,stat.gameinfo as gi
hive > where gt.fgameid=gi.fgameid and fdate='2020-02-17' 
hive > group by gt.fdate,gi.fgamename;

執行結束後，依次将時間修改成産生資料的10天時間，然後可以查詢結果如下：

hive> select * from analysis.gameactive;
OK
2020-02-09      Buffle  268     2020-02-09
2020-02-09      Farm    221     2020-02-09
2020-02-09      LandLord        256     2020-02-09
2020-02-09      PuQ     255     2020-02-09
2020-02-10      Buffle  223     2020-02-10
2020-02-10      Farm    258     2020-02-10
2020-02-10      LandLord        255     2020-02-10
2020-02-10      PuQ     264     2020-02-10
2020-02-11      Buffle  235     2020-02-11
2020-02-11      Farm    259     2020-02-11
2020-02-11      LandLord        246     2020-02-11
2020-02-11      PuQ     260     2020-02-11
2020-02-12      Buffle  253     2020-02-12
2020-02-12      Farm    241     2020-02-12
2020-02-12      LandLord        266     2020-02-12
2020-02-12      PuQ     240     2020-02-12
2020-02-13      Buffle  222     2020-02-13
2020-02-13      Farm    273     2020-02-13
2020-02-13      LandLord        252     2020-02-13
2020-02-13      PuQ     253     2020-02-13
2020-02-14      Buffle  253     2020-02-14
2020-02-14      Farm    257     2020-02-14
2020-02-14      LandLord        245     2020-02-14
2020-02-14      PuQ     245     2020-02-14
2020-02-15      Buffle  251     2020-02-15
2020-02-15      Farm    239     2020-02-15
2020-02-15      LandLord        250     2020-02-15
2020-02-15      PuQ     260     2020-02-15
2020-02-16      Buffle  261     2020-02-16
2020-02-16      Farm    263     2020-02-16
2020-02-16      LandLord        231     2020-02-16
2020-02-16      PuQ     245     2020-02-16
2020-02-17      Buffle  242     2020-02-17
2020-02-17      Farm    223     2020-02-17
2020-02-17      LandLord        261     2020-02-17
2020-02-17      PuQ     274     2020-02-17
2020-02-18      Buffle  271     2020-02-18
2020-02-18      Farm    253     2020-02-18
2020-02-18      LandLord        226     2020-02-18
2020-02-18      PuQ     250     2020-02-18
Time taken: 0.523 seconds, Fetched: 40 row(s)

由此我們看到整個10天裡，每天4個遊戲的參與人數統計就出來了。

同樣接下來實作遊戲使用者年齡段情況統計，使用的hql語句為：

hive > insert overwrite table analysis.gameuserage partition (dt='2020-02-11')
hive > select gt.fdate as fdate,sum(gt.fgametime) as fgametime,ui.fage as fage 
hive > from stat.gametime gt,stat.userinfo as ui
hive > where gt.fuserid=ui.fuserid and fdate='2020-02-11' 
hive > group by ui.fage,gt.fdate;

執行後，依次對每天資料進行統計處理，查詢結果如下（第一列為日期，第二列為使用者玩遊戲的時間，第三列為使用者年齡，因為在産生資料的時候我們設定了年齡段從10到39歲，是以結果就是對這30個年齡組玩遊戲的時間進行了總計分析）：

2020-02-18      1695    10      2020-02-18
2020-02-18      1350    11      2020-02-18
2020-02-18      1325    12      2020-02-18
2020-02-18      990     13      2020-02-18
2020-02-18      1355    14      2020-02-18
2020-02-18      1420    15      2020-02-18
2020-02-18      905     16      2020-02-18
2020-02-18      1140    17      2020-02-18
2020-02-18      1580    18      2020-02-18
2020-02-18      1085    19      2020-02-18
2020-02-18      1350    20      2020-02-18
2020-02-18      1525    21      2020-02-18
2020-02-18      1285    22      2020-02-18
2020-02-18      1105    23      2020-02-18
2020-02-18      1035    24      2020-02-18
2020-02-18      1185    25      2020-02-18
2020-02-18      975     26      2020-02-18
2020-02-18      1625    27      2020-02-18
2020-02-18      1370    28      2020-02-18
2020-02-18      1485    29      2020-02-18
2020-02-18      930     30      2020-02-18
2020-02-18      1390    31      2020-02-18
2020-02-18      1250    32      2020-02-18
2020-02-18      1005    33      2020-02-18
2020-02-18      1250    34      2020-02-18
2020-02-18      1280    35      2020-02-18
2020-02-18      970     36      2020-02-18
2020-02-18      1170    37      2020-02-18
2020-02-18      1015    38      2020-02-18
2020-02-18      1015    39      2020-02-18
Time taken: 0.446 seconds, Fetched: 270 row(s)

（4）sqoop工具的使用

sqoop是一個實作RDMS與HDFS資料交換處理的橋梁工具。實踐過程中安裝相對較為簡單，不過本人在實踐時下載下傳sqoop1.99版本時在驗證一直通不過，各種環境變量、依賴包都設定了還是不行。無奈隻好選擇sqoop1.4版本。

1. 安裝配置

sqoop1.4tar包可以直接從各種鏡像上下載下傳下來，然後解壓到centos，由于解壓後名字較長，可以将其重命名一下。

[[email protected] ~]$ tar -xvf sqoop-1.4.6-cdh5.14.0.tar.gz 
[[email protected] ~]$ mv sqoop-1.4.6-cdh5.14.0.tar.gz sqoop1.4.6

然後設定一下環境變量：

[[email protected] ~]$ vi ~/.bashrc

輸入SQOOP_HOME路徑：

export SQOOP_HOME=/home/hadoop/sqoop1.4.6
export PATH=$PATH:$SQOOP_HOME/bin

接下來進入sqoop檔案夾的conf配置目錄中，配置sqoop-env.sh：

[[email protected] conf]$ mv sqoop-env-template.sh sqoop-env.sh
[[email protected] conf]$ vi sqoop-env.sh

在該檔案中給定hadoop相關路徑：

#Set path to where bin/hadoop is available

export HADOOP_COMMON_HOME=/home/hadoop/hadoop-3.1.2

#Set path to where hadoop-*-core.jar is available

export HADOOP_MAPRED_HOME=/home/hadoop/hadoop-3.1.2

#set the path to where bin/hbase is available

#export HBASE_HOME=

#Set the path to where bin/hive is available

export HIVE_HOME=/home/hadoop/hive-3.1.2-bin

#Set the path for where zookeper config dir is

#export ZOOCFGDIR=

由于本次沒有使用hbase和zookeeper，是以沒有設定兩者的路徑。

如此環境變量就配置完畢，還有兩個jar包需要導入，一個是mysql的jdbc工具包，一個是java的json工具包，這兩個都可以從網上下載下傳下來，然後導入到sqoop的lib目錄中：

[[email protected] lib]$ ll java-json.jar
-rw-r--r--. 1 hadoop hadoop 84697 Feb 13 14:35 java-json.jar
[[email protected] lib]$ ll mysql-connector-java-8.0.16.jar 
-rw-r--r--. 1 hadoop hadoop 2293144 Feb 13 12:05 mysql-connector-java-8.0.16.jar

2. 開始使用

sqoop的文法較為複雜，不過感覺也都是模闆化的，按照規則就可以正常執行。

首先來測試一下，在目前使用者目錄下輸入sqoop help：

[[email protected] ~]$ sqoop help
Warning: /home/hadoop/sqoop1.4.6/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/sqoop1.4.6/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop1.4.6/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/hadoop/sqoop1.4.6/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
2020-02-13 14:56:14,436 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.14.0
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

我們看到help中提示的，也是sqoop的主要使用方法，import 和export，可以将mysql中的表導入到HDFS系統中，也可以将HDFS系統目錄導入到mysql中以表的形式存儲。還可以使用create-hive-table來直接建立hive的表。

例如先将mysql中建立一個資料庫名為sqoop，然後建立一個資料表為gameactive，插入兩行記錄如下：

mysql> select * from gameactive;
Empty set (0.01 sec)

mysql> insert into gameactive values('2020-02-13','puke',100),('2020-02-13','dizhu',150);
Query OK, 2 rows affected (0.14 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from gameactive;
+------------+-----------+--------+
| fdate      | fgamename | fcount |
+------------+-----------+--------+
| 2020-02-13 | puke      |    100 |
| 2020-02-13 | dizhu     |    150 |
+------------+-----------+--------+
2 rows in set (0.00 sec)

然後使用sqoop腳本連接配接mysql和hadoop，将mysql中的這個資料表存入hdfs系統中：

[[email protected] ~]$ sqoop import --connect jdbc:mysql://master:3306/sqoop --username root --password Root-123 --table gameactive --target-dir  /stat/test --delete-target-dir --num-mappers 1 --fields-terminated-by ','

上述腳本分為幾個部分：

sqoop import \ 采用import指令

--connect jdbc:mysql://master:3306/sqoop 使用jdbc連接配接mysql資料庫，主機名為master，端口為3306，sqoop為mysql中的資料庫名

--username root 連接配接時使用的mysql使用者名為root

--password Root-123 連接配接時使用的mysql使用者密碼

--table gameactive 通路mysql中sqoop資料庫裡的gameactive資料表

--target-dir /stat/test 将資料表中内容存入hdfs中的stat/test目錄下

--deleter-target-dir 如果該目錄已存在就删除後再存入

--num-mappers 1 使用mapreduce中的mapper程序，數量為1

--fields-terminated-by ',' 字段之間間隔采用逗号

執行上述腳本，系統會開啟hadoop中的mapreduce任務程序處理：

[[email protected] ~]$ sqoop import --connect jdbc:mysql://master:3306/sqoop --username root --password Root-123 --table gameactive --target-dir  /stat/test --delete-target-dir --num-mappers 1 --fields-terminated-by ','
Warning: /home/hadoop/sqoop1.4.6/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/sqoop1.4.6/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/sqoop1.4.6/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/hadoop/sqoop1.4.6/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
2020-02-13 14:38:17,349 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.14.0
2020-02-13 14:38:17,386 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2020-02-13 14:38:17,526 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2020-02-13 14:38:17,529 INFO tool.CodeGenTool: Beginning code generation
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
2020-02-13 14:38:18,426 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `gameactive` AS t LIMIT 1
2020-02-13 14:38:18,525 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `gameactive` AS t LIMIT 1
2020-02-13 14:38:18,534 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop/hadoop-3.1.2
Note: /tmp/sqoop-hadoop/compile/3acb0490ad7cb85df80adb8d2b955e47/gameactive.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2020-02-13 14:38:20,418 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/3acb0490ad7cb85df80adb8d2b955e47/gameactive.jar
2020-02-13 14:38:21,378 INFO tool.ImportTool: Destination directory /stat/test is not present, hence not deleting.
2020-02-13 14:38:21,378 WARN manager.MySQLManager: It looks like you are importing from mysql.
2020-02-13 14:38:21,378 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
2020-02-13 14:38:21,378 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
2020-02-13 14:38:21,378 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
2020-02-13 14:38:21,386 INFO mapreduce.ImportJobBase: Beginning import of gameactive
2020-02-13 14:38:21,387 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2020-02-13 14:38:21,456 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2020-02-13 14:38:21,491 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2020-02-13 14:38:21,768 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.58.159:8032
2020-02-13 14:38:22,841 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1581564243812_0001
2020-02-13 14:38:54,794 INFO db.DBInputFormat: Using read commited transaction isolation
2020-02-13 14:38:55,013 INFO mapreduce.JobSubmitter: number of splits:1
2020-02-13 14:38:55,632 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1581564243812_0001
2020-02-13 14:38:55,634 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-02-13 14:38:56,023 INFO conf.Configuration: resource-types.xml not found
2020-02-13 14:38:56,023 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2020-02-13 14:38:56,742 INFO impl.YarnClientImpl: Submitted application application_1581564243812_0001
2020-02-13 14:38:56,923 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1581564243812_0001/
2020-02-13 14:38:56,924 INFO mapreduce.Job: Running job: job_1581564243812_0001
2020-02-13 14:39:34,699 INFO mapreduce.Job: Job job_1581564243812_0001 running in uber mode : false
2020-02-13 14:39:34,702 INFO mapreduce.Job:  map 0% reduce 0%
2020-02-13 14:39:48,428 INFO mapreduce.Job:  map 100% reduce 0%
2020-02-13 14:39:49,482 INFO mapreduce.Job: Job job_1581564243812_0001 completed successfully
2020-02-13 14:39:49,615 INFO mapreduce.Job: Counters: 32
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=234015
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=87
                HDFS: Number of bytes written=41
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=21786
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=10893
                Total vcore-milliseconds taken by all map tasks=10893
                Total megabyte-milliseconds taken by all map tasks=22308864
        Map-Reduce Framework
                Map input records=2
                Map output records=2
                Input split bytes=87
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=404
                CPU time spent (ms)=2520
                Physical memory (bytes) snapshot=124497920
                Virtual memory (bytes) snapshot=3604873216
                Total committed heap usage (bytes)=40763392
                Peak Map Physical memory (bytes)=124497920
                Peak Map Virtual memory (bytes)=3604873216
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=41
2020-02-13 14:39:49,631 INFO mapreduce.ImportJobBase: Transferred 41 bytes in 88.1291 seconds (0.4652 bytes/sec)
2020-02-13 14:39:49,646 INFO mapreduce.ImportJobBase: Retrieved 2 records.

處理結束後，可以使用web界面來檢視，也可以直接采用hdfs指令來通路/stat/test目錄：

[[email protected] ~]$ hdfs dfs -ls /stat/test
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2020-02-13 14:39 /stat/test/_SUCCESS
-rw-r--r--   1 hadoop supergroup         41 2020-02-13 14:39 /stat/test/part-m-00000
[[email protected] ~]$ hdfs dfs -cat /stat/test/part-m-00000
2020-02-13,puke,100
2020-02-13,dizhu,150

結果與mysql中建立的資料完全一緻，這樣就實作了mysql與hdfs之間的資料導入。

反過來如果将hdfs中資料導入到mysql中，執行腳本語言就得使用export方式。

首先在mysql端建立一個表getData，

mysql> use sqoop
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> create table getData(fdate varchar(64),fgamename varchar(64),fcount int);
Query OK, 0 rows affected (0.78 sec)

mysql> show tables;
+-----------------+
| Tables_in_sqoop |
+-----------------+
| gameactive      |
| getData         |
+-----------------+
2 rows in set (0.00 sec)

然後回到sqoop端，我們将前面從mysql中導出存儲到HDFS中的資料再導回至mysql中，上述資料在hdfs系統存儲位置為/stat/test,是以sqoop執行腳本寫為：

[[email protected] ~]$ sqoop export --connect jdbc:mysql://master:3306/sqoop --username root --password Root-123 --table getData -m 1 --export-dir '/stat/test' --fields-terminated-by ','

詳細看的話：

sqoop export \ 采用export指令

--connect jdbc:mysql://master:3306/sqoop 使用jdbc連接配接mysql資料庫，主機名為master，端口為3306，sqoop為mysql中的資料庫名

--username root 連接配接時使用的mysql使用者名為root

--password Root-123 連接配接時使用的mysql使用者密碼

--table gameactive 通路mysql中sqoop資料庫裡的gameactive資料表

--target-dir /stat/test 将hdfs中的stat/test目錄下的内容導出到mysql中

--num-mappers 1 使用mapreduce中的mapper程序，數量為1

--fields-terminated-by ',' 字段之間間隔采用逗号

執行結束後，可以去mysql資料中查詢：

mysql> select * from getData;
+------------+-----------+--------+
| fdate      | fgamename | fcount |
+------------+-----------+--------+
| 2020-02-13 | puke      |    100 |
| 2020-02-13 | dizhu     |    150 |
+------------+-----------+--------+
2 rows in set (0.04 sec)

可以看到，資料已經存入mysql中了。

Hive大資料項目實踐

（1）資料的産生

（2）資料的Hive存儲

（3）資料的hive統計分析

（4）sqoop工具的使用

繼續閱讀

《Hive權威指南》第八章：HiveQL索引8 HiveQL：索引

Windows下Cygwin環境的Hadoop安裝（3）- 運作hadoop中的wordcount執行個體遇到的問題和解決方法

MapReduce運作Wordcount時一直卡在INFO mapreduce.Job: Running job，web檢視一直處于accepted階段

ubuntu hadoop2.6.1，terminal下運作wordcount

MapReduce(一)：入門級程式wordcount及其分析

HiveQl語句應用執行個體：WordCount具體步驟如下：

hadoop操作遇到的問題問題一：輸出檔案已存在

Hadoop之運作wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理