文章目錄
-
-
- 1. 通過netcat作為source, sink為logger的方式
-
- 1.1 conf檔案配置
- 1.2 啟動控制台
- 1.3 遠端連接配接端口
- 1.4 測試
- 2. 通過netcat作為source, sink為logger的方式,隻留字母,過濾掉數字
-
- 2.1 配置conf檔案
- 2.2 啟用控制台和遠端連接配接
- 2.3 測試
- 3. 通過netcat作為source, sink寫到HDFS
-
- 3.1 conf配置
- 3.2 啟用控制台和遠端連接配接
- 3.3 測試
-
- 3.3.1 檢驗HDFS
- 3.3.2 輸入測試
- 3.3.3 檢驗HDFS輸出檔案
- 4. 通過HTTP作為source, sink寫到logger
-
- 4.1 配置conf
- 4.2 啟動控制台
- 4.3 輸入HTTP測試
- 4.4 檢視結果
-
1. 通過netcat作為source, sink為logger的方式
1.1 conf檔案配置
# example.conf: 一個單節點的 Flume 執行個體配置
# 配置Agent a1各個元件的名稱
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置Agent a1的source r1的屬性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# 配置Agent a1的sink k1的屬性
a1.sinks.k1.type = logger
# 配置Agent a1的channel c1的屬性,channel是用來緩沖Event資料的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 把source和sink綁定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
這個配置檔案定義了一個Agent叫做a1,a1有一個source監聽本機44444端口上接收到的資料、一個緩沖資料的channel還有一個把Event資料輸出到控制台的sink。這個配置檔案給各個元件命名,并且設定了它們的類型和其他屬性。通常一個配置檔案裡面可能有多個Agent,當啟動Flume時候通常會傳一個Agent名字來做為程式運作的标記。
1.2 啟動控制台
./bin/flume-ng agent --conf conf --conf-file ./conf/flume-netcat.conf -name a1 -Dflume.root.logger=INFO,console
1.3 遠端連接配接端口
[[email protected] ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
1.4 測試
[[email protected] ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
OK
word
OK
dzw
OK
ttt
OK
haddop^H
OK
spark
OK
flume
OK
Flume的終端裡面會以log的形式輸出這個收到的Event内容。
2021-01-19 16:05:27,669 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 0D hello. }
2021-01-19 16:05:29,842 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 77 6F 72 64 0D word. }
2021-01-19 16:05:38,846 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 64 7A 77 0D dzw. }
2021-01-19 16:14:24,955 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 74 74 74 0D ttt. }
2021-01-19 16:19:43,018 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 64 64 6F 70 08 0D haddop.. }
2021-01-19 16:19:52,022 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 61 72 6B 0D spark. }
2021-01-19 16:19:53,289 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 66 6C 75 6D 65 0D flume. }
2. 通過netcat作為source, sink為logger的方式,隻留字母,過濾掉數字
2.1 配置conf檔案
# 配置Agent a1各個元件的名稱
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置Agent a1的source r1的屬性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# source定義正則比對規則
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type =regex_filter
a1.sources.r1.interceptors.i1.regex =^[0-9]*$
a1.sources.r1.interceptors.i1.excludeEvents =true
# 配置Agent a1的sink k1的屬性
a1.sinks.k1.type = logger
# 配置Agent a1的channel c1的屬性,channel是用來緩沖Event資料的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 把source和sink綁定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
增加了正則比對規則部分
2.2 啟用控制台和遠端連接配接
同1
2.3 測試
[[email protected] ~]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
liuyichang
OK
1234
OK
hand
OK
1199
OK
hahahaah
OK
1
OK
2
OK
3
OK
4dididi
OK
12wd34
OK
Connection closed by foreign host.
檢視輸出
2021-01-19 17:29:16,832 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 6C 69 75 79 69 63 68 61 6E 67 0D liuyichang. }
2021-01-19 17:29:31,836 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 6E 64 0D hand. }
2021-01-19 17:30:49,868 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 68 61 68 61 61 68 0D hahahaah. }
2021-01-19 17:30:53,870 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 34 64 69 64 69 64 69 0D 4dididi. }
2021-01-19 17:31:09,362 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 31 32 77 64 33 34 0D 12wd34. }
3. 通過netcat作為source, sink寫到HDFS
3.1 conf配置
# 配置Agent a1各個元件的名稱
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置Agent a1的source r1的屬性
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# 配置Agent a1的sink k1的屬性
#a1.sinks.k1.type = logger
a1.sinks.k1.type=hdfs
#配置HDFS路徑
a1.sinks.k1.hdfs.path=hdfs:/flume
#最終的檔案字首
a1.sinks.k1.hdfs.filePrefix=events
# 表示到了需要觸發的時間時,是否要更新檔案夾,true:表示是
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
# 表示切換時間的機關是分鐘
a1.sinks.k1.hdfs.roundUnit = minute
# 表示過了一分鐘生成一個檔案
a1.sinks.k1.hdfs.roundInterval = 60
a1.sinks.k1.hdfs.fileType = DataStream
# 配置Agent a1的channel c1的屬性,channel是用來緩沖Event資料的
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 把source和sink綁定到channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3.2 啟用控制台和遠端連接配接
啟用控制台
./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hdfs.conf -name a1 -Dflume.root.logge
r=INFO,console
遠端連接配接
telnet localhost 44444
3.3 測試
3.3.1 檢驗HDFS
[[email protected] ~]# hadoop fs -ls /
Found 10 items
-rw-r--r-- 2 root supergroup 1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x - root supergroup 0 2020-12-13 17:41 /data
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /dzw
drwxr-xr-x - root supergroup 0 2020-12-14 18:06 /hadoop
drwxr-xr-x - root supergroup 0 2020-12-29 17:59 /mr_wc
drwxr-xr-x - root supergroup 0 2020-12-29 17:57 /output
drwxr-xr-x - root supergroup 0 2020-12-21 15:34 /prodata
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /test
drwx-wx-wx - root supergroup 0 2020-12-14 21:43 /tmp
drwxr-xr-x - root supergroup 0 2020-12-25 11:40 /user
可以看到此時沒有flume檔案夾
3.3.2 輸入測試
[[email protected] apache-flume-1.6.0-bin]# telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
qwq
OK
qqdeqd
OK
stupid
OK
liuyichang
OK
100086
OK
sichuan
OK
China
OK
panda
OK
3.3.3 檢驗HDFS輸出檔案
[[email protected] ~]# hadoop fs -ls /
Found 11 items
-rw-r--r-- 2 root supergroup 1005 2020-12-07 14:57 /core-site.xml
drwxr-xr-x - root supergroup 0 2020-12-13 17:41 /data
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /dzw
drwxr-xr-x - root supergroup 0 2021-01-20 16:26 /flume
drwxr-xr-x - root supergroup 0 2020-12-14 18:06 /hadoop
drwxr-xr-x - root supergroup 0 2020-12-29 17:59 /mr_wc
drwxr-xr-x - root supergroup 0 2020-12-29 17:57 /output
drwxr-xr-x - root supergroup 0 2020-12-21 15:34 /prodata
drwxr-xr-x - root supergroup 0 2020-12-08 11:30 /test
drwx-wx-wx - root supergroup 0 2020-12-14 21:43 /tmp
drwxr-xr-x - root supergroup 0 2020-12-25 11:40 /user
此時Flume運作自動在HDFS目錄下建立了Flume檔案夾
[[email protected] ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r-- 2 root supergroup 13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 1 items
-rw-r--r-- 2 root supergroup 13 2021-01-20 16:26 /flume/events.1611131189758.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 2 items
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 12 2021-01-20 16:27 /flume/events.1611131231774.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r-- 2 root supergroup 14 2021-01-20 16:27 /flume/events.1611131262116.tmp
[[email protected] ~]# hadoop fs -ls /flume
Found 3 items
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 29 2021-01-20 16:27 /flume/events.1611131231774
-rw-r--r-- 2 root supergroup 14 2021-01-20 16:28 /flume/events.1611131262116
[[email protected] ~]# hadoop fs -ls /flume/events.1611131189758
-rw-r--r-- 2 root supergroup 21 2021-01-20 16:27 /flume/events.1611131189758
[[email protected] ~]# hadoop fs -cat /flume/events.1611131189758
qwq
qqdeqd
stupid
Flume下能夠查詢到輸入的資訊。
注意:出現tmp臨時檔案的原因
因為在conf檔案中配置了一分鐘生成一個檔案,一分鐘之内寫入的檔案都将寫入到tmp檔案中,一分鐘之後傳入的資訊将寫入新的tmp檔案中。
如何設定flume防止小檔案過多?
a、限定一個檔案的檔案資料大小
a1.sinks.k1.hdfs.rollSize = 20010241024
b、限定檔案可以存儲多少個event
a1.sinks.k1.hdfs.rollCount = 10000
4. 通過HTTP作為source, sink寫到logger
4.1 配置conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 配置源
a1.sources.r1.type=org.apache.flume.source.http.HTTPSource
a1.sources.r1.bind=master
a1.sources.r1.port=50020
#配置目标
a1.sinks.k1.type=logger
#配置channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
#綁定源和目标
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
4.2 啟動控制台
./bin/flume-ng agent --conf conf --conf-file ./conf/flume-http.conf -name a1 -Dflume.root.logge
r=INFO,console
4.3 輸入HTTP測試
[[email protected] ~]# curl -X POST -d '[{"headers" : {"timestamp" : "434324343","host" : "random_host.example.com"},"body" : "random_body"
},{"headers" : {"namenode" : "namenode.example.com","datanode" : "random_datanode.example.com"},"body" : "liuyichang"}]' master:50020
4.4 檢視結果
2021-01-20 17:20:26,958 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)]
Event: { headers:{namenode=namenode.example.com, datanode=random_datanode.example.com}
body: 6C 69 75 79 69 63 68 61 6E 67 liuyichang }