一、Flume概述

1、定義

Flume是Cloudera提供的一個高可用的，高可靠的，分布式的海量日志采集、聚合和傳輸的系統。Flume基于流式架構，靈活簡單

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-RV3UdQii-1597403755028)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597309346265.png)]

2、Flume基礎架構

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-KlIPpUWX-1597403755030)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597309392561.png)]

（1）Agent

Agent是一個JVM程序，它以事件的形式将資料從源頭送至目的。

Agent主要有3個部分組成，Source、Channel、Sink。

（2）Source

Source是負責接收資料到Flume Agent的元件。Source元件可以處理各種類型、各種格式的日志資料，包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。

（3）Sink

Sink不斷地輪詢Channel中的事件且批量地移除它們，并将這些事件批量寫入到存儲或索引系統、或者被發送到另一個Flume Agent。

Sink元件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定義。

（4）Channel

Channel是位于Source和Sink之間的緩沖區。是以，Channel允許Source和Sink運作在不同的速率上。Channel是線程安全的，可以同時處理幾個Source的寫入操作和幾個Sink的讀取操作。

Flume自帶兩種Channel：Memory Channel和File Channel。

Memory Channel是記憶體中的隊列。

Memory Channel在不需要關心資料丢失的情景下适用。如果需要關心資料丢失，那麼Memory Channel就不應該使用，因為程式死亡、機器當機或者重新開機都會導緻資料丢失。

File Channel将所有事件寫到磁盤。是以在程式關閉或機器當機的情況下不會丢失資料。

（5） Event

傳輸單元，Flume資料傳輸的基本單元，以Event的形式将資料從源頭送至目的地。Event由Header和Body兩部分組成，Header用來存放該event的一些屬性，為K-V結構，Body用來存放該條資料，形式為位元組數組。

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-C8Bl9S4y-1597403755032)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597309699220.png)]

二、Flume入門

1、Flume安裝部署

（1）安裝位址

Flume官網位址：http://flume.apache.org/

文檔檢視位址：http://flume.apache.org/FlumeUserGuide.html

下載下傳位址：http://archive.apache.org/dist/flume/

（2）安裝部署

（1）将apache-flume-1.9.0-bin.tar.gz上傳到linux的/opt/software目錄下

（2）解壓apache-flume-1.9.0-bin.tar.gz到/opt/module/目錄下

tar -zxvf /opt/software/apache-flume-1.9.0-bin.tar.gz -C /opt/module/

（3）修改apache-flume-1.9.0-bin的名稱為flume

mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume

（4）将lib檔案夾下的guava-11.0.2.jar删除以相容Hadoop 3.1.3

rm /opt/module/flume/lib/guava-11.0.2.jar

2、Flume入門案例

（1）監控端口資料官方案例

1）案例需求：

使用Flume監聽一個端口，收集該端口資料，并列印到控制台。

2）需求分析：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-kMT37mJJ-1597403755034)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597310045998.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-L3tnUCcZ-1597403755035)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597325800948.png)]3）實作步驟：

（1）安裝netcat工具

[[email protected] software]$ sudo yum install -y nc

（2）判斷44444端口是否被占用

[at[email protected] flume-telnet]$ sudo netstat -tunlp | grep 44444

（3）建立配置檔案

（配置檔案名字随便取，隻要在輸入linux指令時對應就行了）

在flume目錄下建立job檔案夾并進入job檔案夾。

[atguigu@hadoop102 flume]$ mkdir job
[[email protected] flume]$ cd job/

在job檔案夾下建立Flume Agent配置檔案netcat-flume-logger.conf。

[[email protected] job]$ vim netcat-flume-logger.conf

在netcat-flume-logger.conf檔案中添加如下内容。

# Name the components on this agent  （a1就是agent的名字，可以随意取）
# 取source，sink，channel的名字
a1.sources = r1       
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 設定source的類型，伺服器ip和端口
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

# Describe the sink
設定sink的類型，這裡設定的是logger就可以以日志形式列印在控制台
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
# channel類型設定為memory類型，channel容量為1000個event，channel傳輸時收到100條event再送出事務
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
#表示r1和c1連接配接起來，k1和c1連接配接起來
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4）先開啟flume監聽端口

[[email protected] flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

agent必須寫前面

–conf表示配置檔案存儲在conf/目錄

–name 後接agent的名字

–conf-file後接自定義的配置檔案路徑

-Dflume.root.logger=INFO,console是在配置logger類型的sink時寫上，才可以以日志形式輸出到控制台

-D表示flume運作時動态修改flume.root.logger參數屬性值，并将控制台日志列印級别設定為INFO級别。日志級别包括:log、info、warn、error。

第二種寫法：

[[email protected] flume]$ bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

5）使用netcat工具向本機的44444端口發送内容

注意：

必須先開啟監聽端口再啟動netcat工具

此時我必須nc localhost，如果nc hadoop102則會報錯：Ncat: Connection refused.因為我配置檔案裡配置的是localhost，如果我把配置檔案改成hadoop102就可以用nc Hadoop02 44444開啟netcat

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-TIkdurqr-1597403755036)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597322571272.png)]

[[email protected] ~]$ nc localhost 44444

hello 

atguigu

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-eNh4hCD1-1597403755037)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597321851650.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-R21tHuYx-1597403755037)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597321889730.png)]

（2）實時監控單個追加檔案

1）案例需求：

實時監控Hive日志，并上傳到HDFS中

2）需求分析：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-5VzOvM3J-1597403755038)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597322670222.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-L072TmHU-1597403755039)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597328856844.png)]3）實作步驟：

1.建立新的配置檔案如下：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#tail -F後面是要監控的檔案目錄，source類型選用exec。
#由于Hive日志在Linux系統中，是以讀取檔案的類型選擇：exec即execute執行的意思。表示執行Linux指令來讀取檔案。
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log


# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

企業版：

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/datas/hive.log


# Describe the sink
#選擇hdfs類型的sink可以将sink輸出上傳到hdfs
a1.sinks.k1.type = hdfs 
#表示上傳到hdfs的路徑。并且以時間 年月日-小時 命名檔案夾
a1.sinks.k1.hdfs.path =/flume/%Y%m%d/%H
#對上傳檔案加字首
a1.sinks.k1.hdfs.filePrefix = logs-
#是否對時間戳取整
a1.sinks.k1.hdfs.round = true
#多少時間機關建立一個新的檔案夾
a1.sinks.k1.hdfs.roundValue = 1
#重新定義時間機關
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地時間戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#積攢多少個Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#設定檔案類型，可支援壓縮
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一個新的檔案
#這裡10s生成檔案，檔案大小設定的比128m小一點點，因為hdfs一個塊大小是128m
a1.sinks.k1.hdfs.rollInterval = 10
#設定每個檔案的滾動大小
a1.sinks.k1.hdfs.rollSize = 134217700
#檔案的滾動與Event數量無關
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.啟動agent

[[email protected] flume]$ bin/flume-ng agent -c conf/ -f job/file-flume-hdfs2.conf -n a1

檢視hdfs，會生成檔案夾，若操作hive，hive更新的日志會傳入到hdfs

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-zzQttRxv-1597403755040)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597326205519.png)]

4）總結：

sink hdfs
hdfs 路徑  /flume/%Y%m%d/%H   隻有配成這樣大概樣式才能保證滾動檔案夾
fileType   Datastream  表示你能看得懂

------滾動檔案
rollInterval  按多久滾動一次hdfs檔案  預設s為機關  在中小企業配成3600秒
rollSize      按多大滾動一次hdfs檔案  預設位元組為機關 在企業一般比塊大小小那麼一丢丢
rollCount     按事件個數滾動hdfs檔案  預設機關是event 在企業設為0
-----滾動檔案夾
round 表示對時間戳取整  取整機關由下面兩個參數決定  預設是flase
roundValue 表示滾動檔案夾的值   預設1
roundUint  表示滾動檔案夾的機關 預設s

（3）實時監控目錄下多個新檔案

1）案例需求：

使用Flume監聽整個目錄的檔案，并上傳至HDFS

2）需求分析：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-CQyej9lz-1597403755040)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597324956275.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-ANkrdvfq-1597403755041)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597326081544.png)]3）實作步驟：

1.建立配置檔案

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
#帶compled字尾的檔案不會上傳
a3.sources.r3.fileSuffix = .COMPLETED
#忽略所有以.tmp結尾的檔案，不上傳
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = /flume/upload/%Y%m%d/%H
#上傳檔案的字首
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照時間滾動檔案夾
a3.sinks.k3.hdfs.round = true
#多少時間機關建立一個新的檔案夾
a3.sinks.k3.hdfs.roundValue = 1
#重新定義時間機關
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地時間戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#積攢多少個Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#設定檔案類型，可支援壓縮
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一個新的檔案
a3.sinks.k3.hdfs.rollInterval = 60
#設定每個檔案的滾動大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#檔案的滾動與Event數量無關
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2.啟動flume

[[email protected] flume]$ bin/flume-ng agent -n a3 -c conf/ -f job/files-flume-hdfs.conf

此時發現hdfs上檔案夾已經建立好了，向監控的檔案夾建立一個新的檔案b.txt時，hdfs檔案夾裡會生成一個tmp結尾的臨時檔案，過一會會變成非臨時檔案，而b.txt檔案由于已經被上傳到hdfs，是以後面有.completed字尾

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-s9y4aprI-1597403755042)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597327580829.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-z9y0Nn1B-1597403755043)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597327611378.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-6pUIuUm2-1597403755044)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597327639920.png)]

4）總結：

1.這個source 是拿來上傳已經完整檔案

2.這個source 能夠去每隔500ms去掃描一次指定的檔案夾

3.不要在指定的檔案夾持續修改檔案會導緻重複上傳

4.不要傳同名檔案兩次，會導緻這個agent挂掉

5.不要傳指定檔案字尾名(.COMPLETED)的檔案

6.先上傳檔案後改名

（4）實時監控多目錄下的多個追加檔案

（taildir source最重要）

Exec source适用于監控一個實時追加的檔案，不能實作斷點續傳；Spooldir Source适合用于同步新檔案，但不适合對實時追加日志的檔案進行監聽并同步；而Taildir Source适合用于監聽多個實時追加的檔案，并且能夠實作斷點續傳。

1）案例需求：

使用Flume監聽整個目錄的實時追加檔案，并上傳至HDFS

2）需求分析：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-LmYcyN6u-1597403755044)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597329048543.png)]

這裡使用 logger sink示範：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-pTTTlgS6-1597403755045)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597329181605.png)]

3）實作步驟：

1.建立配置檔案：

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = TAILDIR
#JSON格式的檔案，用于記錄每個監控的檔案的索引節點，絕對路徑和最後位置。
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1 f2
a3.sources.r3.filegroups.f1 = /opt/module/flume/file1/file1.txt
#以file開頭.txt結尾的檔案
a3.sources.r3.filegroups.f2 = /opt/module/flume/file2/file.*.txt

# Describe the sink
a3.sinks.k3.type = logger


# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

2.建立file1和file2檔案夾，建立相應的file1.txt和file2.txt檔案，并開啟flume監控

[[email protected] flume]$ bin/flume-ng agent -n a3 -c conf/ -f job/files-flume-logger.conf -Dflume.root.logger=INFO,console

3.向file1.txt檔案輸入内容

[[email protected] file1]$ echo "hello file1">>file1.txt

觀察發先輸入的内容以logger列印到控制台，并且多了個tail_dir.json檔案，檔案容都是json對象形式
注意：
在開啟flume監控後新輸入的檔案内容才會列印，之前就存在的内容不會監聽

如果暫停flume監控，将file1.txt檔案更名，追加内容，然後再開啟flume監控還是會列印新追加的内容到控制台，因為tail_dir.json檔案存了file1.txt檔案的inode值，更名不會改變inode值，而且更名之後tail_dir.json檔案中内容file1也更名了。

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-I5AAVNbD-1597403755046)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597330449621.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-262ll35d-1597403755047)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597330761814.png)]

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-LVRDQTm7-1597403755047)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597330823720.png)]

4）總結：

1.首先如果需要追蹤多目錄需要配置filegroups 然後分别給對應的filegroups 指派(檔案的絕對路徑)

2.taildir 如果想要完成斷點續傳：需要記錄位置資訊(inode pos path) 這三個改任何一個都能改變檔案讀取的位置

3.Linux中儲存檔案中繼資料的區域就叫做inode，每個inode都有一個号碼，作業系統用inode号碼來識别不同的檔案，Unix/Linux系統内部不使用檔案名，而使用inode号碼來識别檔案。

e監控還是會列印新追加的内容到控制台，因為tail_dir.json檔案存了file1.txt檔案的inode值，更名不會改變inode值，而且更名之後tail_dir.json檔案中内容file1也更名了。

[外鍊圖檔轉存中...(img-I5AAVNbD-1597403755046)]

[外鍊圖檔轉存中...(img-262ll35d-1597403755047)]

[外鍊圖檔轉存中...(img-LVRDQTm7-1597403755047)]

#### 4）總結：

1.首先如果需要追蹤多目錄 需要配置filegroups 然後分别給對應的filegroups 指派(檔案的絕對路徑)
2.taildir 如果想要完成斷點續傳：需要記錄位置資訊(inode pos path) 這三個改任何一個都能改變檔案讀取的位置

3.Linux中儲存檔案中繼資料的區域就叫做inode，每個inode都有一個号碼，作業系統用inode号碼來識别不同的檔案，Unix/Linux系統内部不使用檔案名，而使用inode号碼來識别檔案。

flume安裝部署以及使用案例一、Flume概述二、Flume入門

一、Flume概述

1、定義

2、Flume基礎架構

（1）Agent

（2）Source

（3）Sink

（4）Channel

（5） Event

二、Flume入門

1、Flume安裝部署

（1）安裝位址

（2）安裝部署

2、Flume入門案例

（1） 監控端口資料官方案例

1）案例需求：

2）需求分析：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-L3tnUCcZ-1597403755035)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597325800948.png)]3）實作步驟：

4）先開啟flume監聽端口

5）使用netcat工具向本機的44444端口發送内容

（2）實時監控單個追加檔案

1）案例需求：

2）需求分析：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-L072TmHU-1597403755039)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597328856844.png)]3）實作步驟：

4）總結：

（3）實時監控目錄下多個新檔案

1）案例需求：

2）需求分析：

[外鍊圖檔轉存失敗,源站可能有防盜鍊機制,建議将圖檔儲存下來直接上傳(img-ANkrdvfq-1597403755041)(E:/%E5%A4%8D%E4%B9%A0%E7%AC%94%E8%AE%B0/flume/flum.assets/1597326081544.png)]3）實作步驟：

4）總結：

（4）實時監控多目錄下的多個追加檔案

1）案例需求：

2）需求分析：

3）實作步驟：

4）總結：

繼續閱讀

（1）監控端口資料官方案例