天天看點

大資料開發之Flume實踐

文章目錄

      • 1. 通過netcat作為source, sink為logger的方式
        • 1.1 conf檔案配置
        • 1.2 啟動控制台
        • 1.3 遠端連接配接端口
        • 1.4 測試
      • 2. 通過netcat作為source, sink為logger的方式,隻留字母,過濾掉數字
        • 2.1 配置conf檔案
        • 2.2 啟用控制台和遠端連接配接
        • 2.3 測試
      • 3. 通過netcat作為source, sink寫到HDFS
        • 3.1 conf配置
        • 3.2 啟用控制台和遠端連接配接
        • 3.3 測試
          • 3.3.1 檢驗HDFS
          • 3.3.2 輸入測試
          • 3.3.3 檢驗HDFS輸出檔案
      • 4. 通過HTTP作為source, sink寫到logger
        • 4.1 配置conf
        • 4.2 啟動控制台
        • 4.3 輸入HTTP測試
        • 4.4 檢視結果

1. 通過netcat作為source, sink為logger的方式

1.1 conf檔案配置

# example.conf: 一個單節點的 Flume 執行個體配置

# 配置Agent a1各個元件的名稱
a1.sources = r1    
a1.sinks = k1      
a1.channels = c1   

# 配置Agent a1的source r1的屬性
a1.sources.r1.type = netcat       
a1.sources.r1.bind = localhost    
a1.sources.r1.port = 44444        

# 配置Agent a1的sink k1的屬性
a1.sinks.k1.type = logger         

# 配置Agent a1的channel c1的屬性,channel是用來緩沖Event資料的
a1.channels.c1.type = memory                
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 把source和sink綁定到channel上
a1.sources.r1.channels = c1       
a1.sinks.k1.channel = c1
           

這個配置檔案定義了一個Agent叫做a1,a1有一個source監聽本機44444端口上接收到的資料、一個緩沖資料的channel還有一個把Event資料輸出到控制台的sink。這個配置檔案給各個元件命名,并且設定了它們的類型和其他屬性。通常一個配置檔案裡面可能有多個Agent,當啟動Flume時候通常會傳一個Agent名字來做為程式運作的标記。

1.2 啟動控制台

./bin/flume-ng agent --conf conf --conf-file ./conf/flume-netcat.conf -name a1 -Dflume.root.logger=INFO,console
           

1.3 遠端連接配接端口

[[email protected] ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
           

1.4 測試

[[email protected] ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
OK
word
OK
dzw
OK
ttt
OK
haddop^H
OK
spark
OK
flume
OK
           

Flume的終端裡面會以log的形式輸出這個收到的Event内容。

2021-01-19 16:05:27,669 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }
2021-01-19 16:05:29,842 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 77 6F 72 64 0D                                  word. }
2021-01-19 16:05:38,846 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 64 7A 77 0D                                     dzw. }
2021-01-19 16:14:24,955 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 74 74 74 0D                                     ttt. }
2021-01-19 16:19:43,018 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 64 64 6F 70 08 0D                         haddop.. }
2021-01-19 16:19:52,022 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 61 72 6B 0D                               spark. }
2021-01-19 16:19:53,289 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 66 6C 75 6D 65 0D                               flume. }
           

2. 通過netcat作為source, sink為logger的方式,隻留字母,過濾掉數字

2.1 配置conf檔案

# 配置Agent a1各個元件的名稱
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置Agent a1的source r1的屬性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# source定義正則比對規則
a1.sources.r1.interceptors = i1  
a1.sources.r1.interceptors.i1.type =regex_filter  
a1.sources.r1.interceptors.i1.regex =^[0-9]*$  
a1.sources.r1.interceptors.i1.excludeEvents =true

# 配置Agent a1的sink k1的屬性
a1.sinks.k1.type = logger

# 配置Agent a1的channel c1的屬性,channel是用來緩沖Event資料的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 把source和sink綁定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
           

增加了正則比對規則部分

2.2 啟用控制台和遠端連接配接

同1

2.3 測試

[[email protected] ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
liuyichang
OK
1234
OK
hand
OK
1199
OK
hahahaah
OK
1
OK
2
OK
3
OK
4dididi
OK
12wd34
OK
Connection closed by foreign host.
           

檢視輸出

2021-01-19 17:29:16,832 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 6C 69 75 79 69 63 68 61 6E 67 0D                liuyichang. }
2021-01-19 17:29:31,836 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 6E 64 0D                                  hand. }
2021-01-19 17:30:49,868 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 68 61 68 61 61 68 0D                      hahahaah. }
2021-01-19 17:30:53,870 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 34 64 69 64 69 64 69 0D                         4dididi. }
2021-01-19 17:31:09,362 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 31 32 77 64 33 34 0D                            12wd34. }
           

3. 通過netcat作為source, sink寫到HDFS

3.1 conf配置

# 配置Agent a1各個元件的名稱
a1.sources = r1    
a1.sinks = k1      
a1.channels = c1   
# 配置Agent a1的source r1的屬性
a1.sources.r1.type = netcat       
a1.sources.r1.bind = localhost    
a1.sources.r1.port = 44444        
# 配置Agent a1的sink k1的屬性
#a1.sinks.k1.type = logger         
a1.sinks.k1.type=hdfs
#配置HDFS路徑
a1.sinks.k1.hdfs.path=hdfs:/flume
#最終的檔案字首
a1.sinks.k1.hdfs.filePrefix=events
# 表示到了需要觸發的時間時,是否要更新檔案夾,true:表示是
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
# 表示切換時間的機關是分鐘
a1.sinks.k1.hdfs.roundUnit = minute
# 表示過了一分鐘生成一個檔案
a1.sinks.k1.hdfs.roundInterval = 60 
a1.sinks.k1.hdfs.fileType = DataStream
# 配置Agent a1的channel c1的屬性,channel是用來緩沖Event資料的
a1.channels.c1.type = memory                
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 把source和sink綁定到channel上
a1.sources.r1.channels = c1       
a1.sinks.k1.channel = c1
           

3.2 啟用控制台和遠端連接配接

啟用控制台

./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hdfs.conf -name a1 -Dflume.root.logge
r=INFO,console  
           

遠端連接配接

telnet localhost 44444
           

3.3 測試

3.3.1 檢驗HDFS
[[email protected] ~]# hadoop fs -ls / 
Found 10 items
-rw-r--r--   2 root supergroup       1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x   - root supergroup          0 2020-12-13 17:41 /data
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /dzw
drwxr-xr-x   - root supergroup          0 2020-12-14 18:06 /hadoop
drwxr-xr-x   - root supergroup          0 2020-12-29 17:59 /mr_wc
drwxr-xr-x   - root supergroup          0 2020-12-29 17:57 /output
drwxr-xr-x   - root supergroup          0 2020-12-21 15:34 /prodata
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /test
drwx-wx-wx   - root supergroup          0 2020-12-14 21:43 /tmp
drwxr-xr-x   - root supergroup          0 2020-12-25 11:40 /user
           

可以看到此時沒有flume檔案夾

3.3.2 輸入測試
[[email protected] apache-flume-1.6.0-bin]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
qwq
OK
qqdeqd
OK
stupid
OK
liuyichang
OK
100086
OK
sichuan
OK
China
OK
panda
OK
           
3.3.3 檢驗HDFS輸出檔案
[[email protected] ~]# hadoop fs -ls /
Found 11 items
-rw-r--r--   2 root supergroup       1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x   - root supergroup          0 2020-12-13 17:41 /data
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /dzw
drwxr-xr-x   - root supergroup          0 2021-01-20 16:26 /flume
drwxr-xr-x   - root supergroup          0 2020-12-14 18:06 /hadoop
drwxr-xr-x   - root supergroup          0 2020-12-29 17:59 /mr_wc
drwxr-xr-x   - root supergroup          0 2020-12-29 17:57 /output
drwxr-xr-x   - root supergroup          0 2020-12-21 15:34 /prodata
drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /test
drwx-wx-wx   - root supergroup          0 2020-12-14 21:43 /tmp
drwxr-xr-x   - root supergroup          0 2020-12-25 11:40 /user
           

此時Flume運作自動在HDFS目錄下建立了Flume檔案夾

[[email protected] ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r--   2 root supergroup         13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r--   2 root supergroup         13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 2 items
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r--   2 root supergroup         12 2021-01-20 16:27 /flume/events.1611131231774.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r--   2 root supergroup         29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r--   2 root supergroup         14 2021-01-20 16:27 /flume/events.1611131262116.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r--   2 root supergroup         29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r--   2 root supergroup         14 2021-01-20 16:28 /flume/events.1611131262116
[[email protected] ~]# hadoop fs -ls /flume/events.1611131189758   
-rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
[[email protected] ~]# hadoop fs -cat /flume/events.1611131189758
qwq
qqdeqd
stupid
           

Flume下能夠查詢到輸入的資訊。

注意:出現tmp臨時檔案的原因

因為在conf檔案中配置了一分鐘生成一個檔案,一分鐘之内寫入的檔案都将寫入到tmp檔案中,一分鐘之後傳入的資訊将寫入新的tmp檔案中。

如何設定flume防止小檔案過多?

a、限定一個檔案的檔案資料大小

a1.sinks.k1.hdfs.rollSize = 20010241024

b、限定檔案可以存儲多少個event

a1.sinks.k1.hdfs.rollCount = 10000

4. 通過HTTP作為source, sink寫到logger

4.1 配置conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置源
a1.sources.r1.type=org.apache.flume.source.http.HTTPSource
a1.sources.r1.bind=master
a1.sources.r1.port=50020

#配置目标
a1.sinks.k1.type=logger

#配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

#綁定源和目标
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
           

4.2 啟動控制台

./bin/flume-ng agent --conf conf --conf-file ./conf/flume-http.conf -name a1 -Dflume.root.logge
r=INFO,console
           

4.3 輸入HTTP測試

[[email protected] ~]# curl -X POST -d '[{"headers" : {"timestamp" : "434324343","host" : "random_host.example.com"},"body" : "random_body"
},{"headers" : {"namenode" : "namenode.example.com","datanode" : "random_datanode.example.com"},"body" : "liuyichang"}]' master:50020
           

4.4 檢視結果

2021-01-20 17:20:26,958 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] 
Event: { headers:{namenode=namenode.example.com, datanode=random_datanode.example.com} 
body: 6C 69 75 79 69 63 68 61 6E 67                   liuyichang }