HDFS Sink

This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events. Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster. Note that a version of Hadoop that supports the sync() call is required.

HDFS Sink将events寫入Hadoop的分布式檔案系統中(HDFS)。目前支援建立text文本以及序列化檔案，兩者都支援壓縮。檔案可以基于運作時間以及資料大小或者events的數量來定期地滾動（關閉目前檔案并建立一個新的）。它也可以通過類似timestamp時間戳或資料的機器來源（即資料來源于哪個機器）來将資料進行分桶或分區。HDFS的目錄路徑可以包含格式化escape序列，該序列将被HDFS sink取代，來生成一個目錄/檔案名，來儲存檔案。使用這個sink需要安裝好hadoop，flume才能使用hadoop的jar包，進而與HDFS的叢集進行通話。注意，hadoop的版本需要支援同步通路。

一、escape 序列支援

The following are the escape sequences supported:

Alias	Description
%{host}	Substitute value of event header named “host”. Arbitrary header names are supported.
%t	Unix time in milliseconds
%a	locale’s short weekday name (Mon, Tue, …)
%A	locale’s full weekday name (Monday, Tuesday, …)
%b	locale’s short month name (Jan, Feb, …)
%B	locale’s long month name (January, February, …)
%c	locale’s date and time (Thu Mar 3 23:05:25 2005)
%d	day of month (01)
%e	day of month without padding (1)
%D	date; same as %m/%d/%y
%H	hour (00…23)
%I	hour (01…12)
%j	day of year (001…366)
%k	hour ( 0…23)
%m	month (01…12)
%n	month without padding (1…12)
%M	minute (00…59)
%p	locale’s equivalent of am or pm
%s	seconds since 1970-01-01 00:00:00 UTC
%S	second (00…59)
%y	last two digits of year (00…99)
%Y	year (2010)
%z	+hhmm numeric timezone (for example, -0400)
%[localhost]	Substitute the hostname of the host where the agent is running
%[IP]	Substitute the IP address of the host where the agent is running
%[FQDN]	Substitute the canonical hostname of the host where the agent is running

Note: The escape strings %[localhost], %[IP] and %[FQDN] all rely on Java’s ability to obtain the hostname, which may fail in some networking environments.

注意：escape字元串 %[localhost], %[IP] 和 %[FQDN]都依賴java來擷取主機名，在某些網絡環境中可能失效。

The file in use will have the name mangled to include ”.tmp” at the end. Once the file is closed, this extension is removed. This allows excluding partially complete files in the directory. Required properties are in bold.

有些使用中的檔案的檔案名結尾會包含“.tmp”，一旦檔案關閉，就會被删除。

Note For all of the time related escape sequences, a header with the key “timestamp” must exist among the headers of the event (unless hdfs.useLocalTimeStamp is set to true). One way to add this automatically is to use the Timestamp Interceptor.

記錄所有與時間相關的escape sequences時，“timestamp”需要作為event的頭（除非hdfs.useLocalTimeStamp被設定為true）。可以通過使用時間戳攔截器的方法實作自動添加。

Name	Default	Description
channel	–
type	–	The component type name, needs to be hdfs
hdfs.path	–	HDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix	FlumeData	Name prefixed to files created by Flume in hdfs directory
hdfs.fileSuffix	–	Suffix to append to file (eg .avro - NOTE: period is not automatically added)
hdfs.inUsePrefix	–	Prefix that is used for temporal files that flume actively writes into
hdfs.inUseSuffix	.tmp	Suffix that is used for temporal files that flume actively writes into
hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize	1024	File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout	Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize	100	number of events written to file before it is flushed to HDFS
hdfs.codeC	–	Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
hdfs.fileType	SequenceFile	File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles	5000	Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.minBlockReplicas	–	Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
hdfs.writeFormat	Writable	Format for sequence file records. One of Text or Writable. Set to Text before creating data files with Flume, otherwise those files cannot be read by either Apache Impala (incubating) or Apache Hive.
hdfs.callTimeout	10000	Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize	10	Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.rollTimerPoolSize	1	Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipal	–	Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab	–	Kerberos keytab for accessing secure HDFS
hdfs.proxyUser
hdfs.round	false	Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
hdfs.roundValue	1	Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time.
hdfs.roundUnit	second	The unit of the round down value - second, minute or hour.
hdfs.timeZone	Local Time	Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp	false	Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
hdfs.closeTries	Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval	180	Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializer	TEXT	Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.
serializer.*

以名為a1的agent為例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

The above configuration will round down the timestamp to the last 10th minute. For example, an event with timestamp 11:54:34 AM, June 12, 2012 will cause the hdfs path to become /flume/events/2012-06-12/1150/00.

以上配置将記錄時間戳，精确（四舍五入）到最後的10分鐘

例如一個event的時間戳是11:54:34 AM, June 12, 2012 ，其在hdfs中的路徑名将為/flume/events/2012-06-12/1150/00

fume的sink---HDFS SinkHDFS Sink

HDFS Sink

一、escape 序列支援

以名為a1的agent為例

繼續閱讀

flume架構

Flume日志收集系統詳解----硬核解析Flume日志收集系統詳解一、Flume簡介二、Flume原理三、flume建立執行個體

【hadoop fs指令】if，then，else，fi測試

初識hadoop--（2）通過java操作hdfs

Hive（二）--分區分桶，内部表外部表

MapReduce的輸入與輸出類型詳解

實時讀取本地檔案到HDFS

flume采集檔案到hdfs

大資料開發之Flume實踐

flume案例--實時采集檔案的内容變化到hdfs

flume實時寫資料到HA模式下的hdfs

Linux搭建Flume開發環境

大資料技術原理與應用（最後三天備考了！！！）

Hadoop FSDataInputStream 和FSDataOutputStream 用法

《Hive權威指南》第八章：HiveQL索引8 HiveQL：索引

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount