Flume NG 配置详解

配置

设置代理

Flume代理配置存储在本地配置文件。这是一个文本文件格式，是Java属性文件格式。在相同的配置文件，可以指定一个或多个代理的配置。配置文件包括每个源，接收器和通道，把它们连接在一起，形成数据流。

配置单个组件

流中每个组件（源，接收器或通道）都有名称，类型，和一组特定实例的属性。例如，Avro源需要一个接收数据的主机名（或IP地址）和端口号。一个内存通道可以有最大队列大小（“能力”），HDFS的Sink需要知道文件系统的URI，路径创建文件，文件的创建频率（“hdfs.rollInterval”）等，所有这些组件的属性需要设置在Flume代理的属性文件。

组合组件

代理需要知道如何加载各个组件以及它们是如何连接，以构成流。这是通过列出的源，接收器及通道的名称，然后指定每个接收器和源的连接通道。例如，流定义，Avro源avroWeb 到HDFS接收器hdfs-cluster1，通过JDBC通道jdbc-channel。该配置文件将包含这些组件，jdbc-channel通道作为avroWeb源和hdfs-cluster接收器共享存在。

flume-ng 命令行参数

Usage: ./flume-ng <command> [options]...
commands:
help display this help text
agent run a Flume agent
avro-client run an avro Flume client
global options:
--conf,-c <conf> use configs in <conf> directory
--classpath,-C <cp> append to the classpath
--dryrun,-d do not actually start Flume, just print the command
-Dproperty=value sets a JDK system property value
agent options:
--conf-file,-f specify a config file (required)
--name,-n the name of this agent (required)
--help,-h display help text
avro-client options:
--host,-H <host> hostname to which events will be sent (required)
--port,-p <port> port of the avro source (required)
--filename,-F <file> text file to stream to avro source [default: std input]
--headerFile,-R <file> headerFile containing headers as key/value pairs on each new line
--help,-h display help text
Note that if <conf> directory is specified, then it is always included first
in the classpath.

定义流

启动代理

代理是通过使用在bin目录下的shell脚本flume-ng。你需要在命令行上指定的代理的名称和配置文件

$ bin/flume-ng agent -n foo -f conf/flume-conf.properties.template

数据摄取

Flume支持摄取外部数据源的数量的机制。

RPC

Avro客户端包含在Flume发行版本中，可以发送一个给定的文件给Flume，Avro 源使用AVRO RPC机制。

$ bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10

上面的命令，将要发送/usr/logs/log.10到Flume Source(监听在41414端口)

执行命令

还有一个exec执行一个给定的命令获得输出的源。一个单一的输出，即“line”。回车（'\ R'）或换行符（'\ N'），或两者一起的文本。Flume不支持tail做为一个源，不过可以通过exec tail。

网络流

Flume支持以下的机制，从流行的日志流类型读取数据

1)Avro

2)Syslog

3)Netcat

Flume部署种类

1)多代理流程

2)合并

3)多路复用流

配置

Flume代理配置读取一个文件，类似于一个Java属性格式文件。

定义流

在一个单一的代理定义的流，你需要通过一个通道的来源和接收器链接。你需要列出源，接收器和通道，为给定的代理，然后指向源和接收器及通道。一个源的实例可以指定多个通道，但只能指定一个接收器实例通道。格式如下：

#List the sources, sinks and channels for the agent

<agent>.sources = <Source>

<agent>.sinks = <Sink>

<agent>.channels = <Channel1> <Channel2>

#set channel for source

<agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

#set channel for sink

<agent>.sinks.<Sink>.channel = <Channel1>

例如一个代理名为weblog-agent，外部通过avro客户端，并且发送数据通过内存通道给hdfs。在配置文件weblog.config的可能看起来像这样：

weblog-agent.sources = avro-AppSrv-source

weblog-agent.sinks = hdfs-Cluster1-sink

weblog-agent.channels = mem-channel-1

#set channel for source

weblog-agent.sources.avro-AppSrv-source.channels = mem-channel-1

#set channel for sink

weblog-agent.sinks.hdfs-Cluster1-sink.channel = mem-channel-1

这将使事件流从avro-AppSrv-source到hdfs-Cluster1-sink通过内存通道mem-channel-1。当代理开始weblog.config作为其配置文件，它会实例化流。

配置单个组件

定义流之后，你需要设置每个源，接收器和通道的属性。可以分别设定组件的属性值。

#Properties for sources

<agent>.sources.<Source>.<someProperty> = <someValue>

#Properties for channels

<agent>.channel.<Channel>.<someProperty> = <someValue>

#Properties for sinks

<agent>.sources.<Sink>.<someProperty> = <someValue>

“type”属性必须为每个组件设置，以了解它需要什么样的对象。每个源，接收器和通道类型有其自己的一套，它所需的性能，以实现预期的功能。所有这些，必须根据需要设置。在前面的例子中，我们拿到从hdfs-Cluster1-sink中的流到HDFS，通过内存通道mem-channel-1的avro-AppSrv-source源。下面是一个例子，显示了这些组件的配置。

weblog-agent.sources = avro-AppSrv-source

weblog-agent.sinks = hdfs-Cluster1-sink

weblog-agent.channels = mem-channel-1

#set channel for sources, sinks

#properties of avro-AppSrv-source

weblog-agent.sources.avro-AppSrv-source.type = avro

weblog-agent.sources.avro-AppSrv-source.bind = localhost

weblog-agent.sources.avro-AppSrv-source.port = 10000

#properties of mem-channel-1

weblog-agent.channels.mem-channel-1.type = memory

weblog-agent.channels.mem-channel-1.capacity = 1000

weblog-agent.channels.mem-channel-1.transactionCapacity = 100

#properties of hdfs-Cluster1-sink

weblog-agent.sinks.hdfs-Cluster1-sink.type = hdfs

weblog-agent.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata/

…

在代理添加多个流

单个Flume代理可以包含几个独立的流。你可以在一个配置文件中列出多个源，接收器和通道。这些组件可以连接形成多个流。

#List the sources, sinks and channels for the agent

<agent>.sources = <Source1> <Source2>

<agent>.sinks = <Sink1> <Sink2>

<agent>.channels = <Channel1> <Channel2>

那么你就可以连接源和接收器到其相应的通道，设置两个不同的流。例如，如果您需要设置一个weblog代理两个流，一个从外部Avro客户端到HDFS，另外一个是tail的输出到Avro接收器，然后在这里是做一个配置：

#List the sources, sinks and channels in the agent

weblog-agent.sources = avro-AppSrv-source1 exec-tail-source2

weblog-agent.sinks = hdfs-Cluster1-sink1 avro-forward-sink2

weblog-agent.channels = mem-channel-1 jdbc-channel-2

## Flow-1 configuration

weblog-agent.sources.avro-AppSrv-source1.channels = mem-channel-1

weblog-agent.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

## Flow-2 configuration

weblog-agent.sources.exec-tail-source2.channels = jdbc-channel-2

weblog-agent.sinks.avro-forward-sink2.channel = jdbc-channel-2

配置多代理流程

设置一个多层的流，你需要有一个指向下一跳avro源的第一跳的avro 接收器。这将导致第一Flume代理转发事件到下一个Flume代理。例如，如果您定期发送的文件，每个事件（1文件）AVRO客户端使用本地Flume代理，那么这个当地的代理可以转发到另一个有存储的代理。

## weblog agent config

#List sources, sinks and channels in the agent

weblog-agent.sources = avro-AppSrv-source

weblog-agent.sinks = avro-forward-sink

weblog-agent.channels = jdbc-channel

#define the flow

weblog-agent.sources.avro-AppSrv-source.channels = jdbc-channel

weblog-agent.sinks.avro-forward-sink.channel = jdbc-channel

#avro sink properties

weblog-agent.sources.avro-forward-sink.type = avro

weblog-agent.sources.avro-forward-sink.hostname = 10.1.1.100

weblog-agent.sources.avro-forward-sink.port = 10000

#configure other pieces

...

## hdfs-agent config

#List sources, sinks and channels in the agent

hdfs-agent.sources = avro-collection-source

hdfs-agent.sinks = hdfs-sink

hdfs-agent.channels = mem-channel

#define the flow

hdfs-agent.sources.avro-collection-source.channels = mem-channel

hdfs-agent.sinks.hdfs-sink.channel = mem-channel

#avro source properties

hdfs-agent.sources.avro-collection-source.type = avro

hdfs-agent.sources.avro-collection-source.bind = 10.1.1.100

hdfs-agent.sources.avro-collection-source.port = 10000

#configure other pieces

...

这里我们连接从weblog-agent的avro-forward-sink 到hdfs-agent的avro-collection-source收集源。最终结果从外部源的appserver最终存储在HDFS的事件。

扇出流

Flume支持扇出流从一个源到多个通道。有两种模式的扇出，复制和复用。在复制流的事件被发送到所有的配置通道。在复用的情况下，事件被发送到合格的渠道只有一个子集。煽出流，需要指定源和煽出通道的规则。这是通过添加一个通道“选择”，可以复制或复。再进一步指定选择的规则，如果它是一个多路。如果你不指定一个选择，则默认情况下它复制。

#List the sources, sinks and channels for the agent

<agent>.sources = <Source1>

<agent>.sinks = <Sink1> <Sink2>

<agent>.channels = <Channel1> <Channel2>

#set list of channels for source (separated by space)

<agent>.sources.<Source1>.channels = <Channel1> <Channel2>

#set channel for sinks

<agent>.sinks.<Sink1>.channel = <Channel1>

<agent>.sinks.<Sink2>.channel = <Channel2>

<agent>.sources.<Source1>.selector.type = replicating

复用的选择集的属性进一步分叉。这需要指定一个事件属性映射到一组通道。选择配置属性中的每个事件头检查。如果指定的值相匹配，那么该事件被发送到所有的通道映射到该值。如果没有匹配，那么该事件被发送到设置为默认配置的通道。

# Mapping for multiplexing selector

<agent>.sources.<Source1>.selector.type = multiplexing

<agent>.sources.<Source1>.selector.header = <someHeader>

<agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>

<agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>

<agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>

...

<agent>.sources.<Source1>.selector.default = <Channel2>

映射允许每个值通道可以重叠。默认值可以包含任意数量的通道。下面的示例中有一个单一的流复用两条路径。代理有一个单一的avro源和连接道两个接收器的两个通道。

#List the sources, sinks and channels in the agent

weblog-agent.sources = avro-AppSrv-source1

weblog-agent.sinks = hdfs-Cluster1-sink1 avro-forward-sink2

weblog-agent.channels = mem-channel-1 jdbc-channel-2

# set channels for source

weblog-agent.sources.avro-AppSrv-source1.channels = mem-channel-1 jdbc-channel-2

#set channel for sinks

weblog-agent.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

weblog-agent.sinks.avro-forward-sink2.channel = jdbc-channel-2

weblog-agent.sources.avro-AppSrv-source1.selector.type = multiplexing

weblog-agent.sources.avro-AppSrv-source1.selector.header = State

weblog-agent.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1

weblog-agent.sources.avro-AppSrv-source1.selector.mapping.AZ = jdbc-channel-2

weblog-agent.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 jdbc-channel-2

weblog-agent.sources.avro-AppSrv-source1.selector.default = mem-channel-1

“State”作为Header的选择检查。如果值是“CA”，然后将其发送到mem-channel-1，如果它的“AZ”的，那么jdbc-channel-2，如果它的“NY”那么发到这两个。如果“State”头未设置或不匹配的任何三个，然后去默认的mem-channel-1通道。

Flume 源(Source)

Avro 源

Avro端口监听并接收来自外部的Avro客户流的事件。当内置AvroSink另一个（前跳）Flume代理，它可以创建分层集合配对拓扑。

Property Name	Default	Description
type	-	The component type name, needs to be avro
bind	-	hostname or IP address to listen on
port	-	Port # to bind to

Exec 源

此源启动运行一个给定的Unix命令，预计这一过程中不断产生标准输出（stderr被简单地丢弃，除非logStdErr= TRUE）上的数据。如果因任何原因的进程退出时，源也退出，并不会产生任何进一步的数据。

Property Name	Default	Description
type	-	The component type name, needs to be exec
command	-	The command to execute
restartThrottle	10000	Amount of time (in millis) to wait before attempting a restart
restart	false	Whether the executed cmd should be restarted if it dies
logStdErr	false	Whether the command’s stderr should be logged

备注: 在ExecSource不能保证，如果有一个失败的放入到通道的事件，客户也知道。在这种情况下，数据将丢失。

例,

exec-agent.sources = tail

exec-agent.channels = memoryChannel-1

exec-agent.sinks = logger

exec-agent.sources.tail.type = exec

exec-agent.sources.tail.command = tail -f /var/log/secure

NetCat 源

一个netcat在某一端口上侦听，每一行文字变成一个事件源。行为像“nc -k -l [主机][端口]”。换句话说，它打开一个指定端口，侦听数据。意料的是，所提供的数据是换行符分隔的文本。每一行文字变成Flume事件，并通过连接通道发送。

Property Name	Default	Description
type	-	The component type name, needs to be netcat
bind	-	Host name or IP address to bind to
port	-	Port # to bind to
max-line-length	512	Max line length per event body (in bytes)

序列发生器源

一个简单的序列发生器，不断产成与事件计数器0和1的增量开始。主要用于测试。

Property Name	Default	Description
type	-	The component type name, needs to be seq

Syslog 源

读取syslog数据，并生成Flume事件。 UDP源将作为一个单一的事件的整个消息。 TCP源回车（\ n）来分隔的字符串创建一个新的事件。

Syslog TCP

Property Name	Default	Description
type	-	The component type name, needs to be syslogtcp
host	-	Host name or IP address to bind to
port	-	Port # to bind to

例如, a syslog TCP source:

syslog-agent.sources = syslog

syslog-agent.channels = memoryChannel-1

syslog-agent.sinks = logger

syslog-agent.sources.syslog.type = syslogtcp

syslog-agent.sources.syslog.port = 5140

syslog-agent.sources.syslog.host = localhost

Syslog UDP

Property Name	Default	Description
type	-	The component type name, needs to be syslogudp
host	-	Host name or IP address to bind to
port	-	Port # to bind to

例如, a syslog UDP source:

syslog-agent.sources = syslog

syslog-agent.channels = memoryChannel-1

syslog-agent.sinks = logger

syslog-agent.sources.syslog.type = syslogudp

syslog-agent.sources.syslog.port = 5140

syslog-agent.sources.syslog.host = localhost

遗留源

遗留源，让Flume1.x的代理收到Flume0.9.4代理的事件。接受在Flume0.9.4格式的事件，并将它们转换为Flume1.0格式，并存储在连接的通道。如时间戳0.9.4事件属性，PRI，主机，毫微秒，等地转化为1.x的事件头属性。

遗留源支持的Avro和Thrift的RPC连接。使用这两个Flume版本之间的桥梁，您需要启动与avroLegacy或thriftLegacy源Flume1.x的代理。 0.9.4代理应该有指向1.x的代理主机/端口的agentSink。

Avro Legacy

Property Name	Default	Description
type	-	The component type name, needs to be org.apache.flume.source.avroLegacy.AvroLegacySource
host	-	The hostname or IP address to bind to
port	-	The port # to listen on

Thrift Legacy

Property Name	Default	Description
type	-	The component type name, needs to be org.apache.source.thriftLegacy.ThriftLegacySource
host	-	The hostname or IP address to bind to
port	-	The port # to listen on

注：Flume1.x中的可靠性语义不同的是从0.9.x.端到端或DFO模式的0.9.x版本的代理不会被遗留源支持。 0.9.x版本唯一支持的模式是Best Effort。

自定义源

自定义的来源是你自己的实现Source接口。自定义源的类和它的依赖，必须包含在代理的classpath时开始运行Flume代理。自定义源的类型是其FQCN( Fully-Qualified Class Name)。

Flume 接收器(Sink)

HDFS Sink

这Sink写入到Hadoop分布式文件系统（HDFS）的事件。目前，它支持创建文本和序列文件。它支持在这两个文件类型的压缩。对所用的时间、数据大小、事件的数量为参数，对文件进行关闭（关闭当前文件，并创建一个新的）。它还可以对事件源的机器名及时间属性分离数据。 HDFS目录路径可能包含格式转义序列将取代由HDFS sink生成一个目录/文件名存储的事件。

以下是支持的转义序列 -

%{host}	host name stored in event header
%t	Unix time in milliseconds
%a	locale’s short weekday name (Mon, Tue, …)
%A	locale’s full weekday name (Monday, Tuesday, …)
%b	locale’s short month name (Jan, Feb,…)
%B	locale’s long month name (January, February,…)
%c	locale’s date and time (Thu Mar 3 23:05:25 2005)
%d	day of month (01)
%D	date; same as %m/%d/%y
%H	hour (00..23)
%I	hour (01..12)
%j	day of year (001..366)
%k	hour ( 0..23)
%m	month (01..12)
%M	minute (00..59)
%P	locale’s equivalent of am or pm
%s	seconds since 1970-01-01 00:00:00 UTC
%S	second (00..59)
%y	last two digits of year (00..99)
%Y	year (2010)
%z	+hhmm numeric timezone (for example, -0400)

使用中的文件将有指定扩展名，以".tmp"结尾。一旦文件被关闭，该扩展被删除。这使得不包括部分完成的文件在该目录中。

Name	Default	Description
type	-	The component type name, needs to be hdfs
hdfs.path	-	HDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix	FlumeData	Name prefixed to files created by Flume in hdfs directory
hdfs.rollInterval	30	Number of seconds to wait before rolling current file
hdfs.rollSize	1024	File size to trigger roll (in bytes)
hdfs.rollCount	10	Number of events written to file before it rolled
hdfs.batchSize	1	number of events written to file before it flushed to HDFS
hdfs.txnEventMax	100
hdfs.codeC	-	Compression codec. one of following : gzip, bzip2, lzo, snappy
hdfs.fileType	SequenceFile	File format - currently SequenceFile or DataStream
hdfs.maxOpenFiles	5000
hdfs.writeFormat	-	“Text” or “Writable”
hdfs.appendTimeout	1000
hdfs.callTimeout	5000
hdfs.threadsPoolSize	10
hdfs.kerberosPrincipal	“”	Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab	“”	Kerberos keytab for accessing secure HDFS

Logger sink

INFO级别的日志事件。通常有用的测试/调试目的。

type

The component type name, needs to be logger

Avro

avro支持Flume分层。Flume事件发送到sink通过avro事件发送到配置的主机名/端口对。这些事件可以批量传输到通道。

Property Name	Default	Description
type	-	The component type name, needs to be avro
hostname	-	The hostname or IP address to bind to
port	-	The port # to listen on
batch-size	100	number of event to batch together for send.

IRC Sink

IRC Sink 从通道中取得信息到IRC Server。

The IRC sink takes messages from attached channel and relays those to configured IRC destinations.

Property Name	Default	Description
type	-	The component type name, needs to be irc
hostname	-	The hostname or IP address to connect to
port	6667	The port number of remote host to connect
nick	-	Nick name
user	-	User name
password	-	User password
chan	-	channel
name
splitlines	-	(boolean)
splitchars	\n	line separator (if you were to enter the default value into the config file, the you would need to escape the backslash, like this: \\n)

File Role

Property Name	Default	Description
type	-	The component type name, needs to be file_roll
sink.directory	-
sink.rollInterval	30

Null

Property Name	Default	Description
type	-	The component type name, needs to be null

自定义Sink

自定义接收器是你自己的Sink接口实现。自定义Sink和它的依赖必须包含在代理中的classpath。自定义Sink的类型是其FQCN。

Flume通道

通道是一个仓库，事件存储在上面。源通过通道添加事件，接收器通过通道取事件。

内存通道

事件存储在一个可配置的最大尺寸在内存中的队列。适用场景：需要更高的吞吐量，代理出现故障后数据丢失的情况。

Property Name	Default	Description
type	-	The component type name, needs to be memory
capacity	100	The max number of events stored in the channel
transactionCapacity	100	The max number of events stored in the channel per transaction
keep-alive	3	Timeout in seconds for adding or removing an event

JDBC通道

事件存储在数据库。目前的JDBC通道支持嵌入式Derby。这是一个持久的理想的地方，可恢复是很主要的特性。

Property Name	Default	Description
type	-	The component type name, needs to be jdbc
db.type	DERBY	Database vendor, needs to be DERBY.
driver.class	org.apache.derby.jdbc.EmbeddedDriver	Class for vendors JDBC driver
driver.url	(constructed from other properties)	JDBC connection URL
db.username	sa	User id for db connection
db.password		password for db connection
connection.properties.file	-	JDBC Connection property file path
create.schema	true	If true, then creates db schema if not there
create.index	true	Create indexes to speed up lookups
create.foreignkey	true
transaction.isolation	READ_COMMITTED	Isolation level for db session READ_UNCOMMITTED, READ_COMMITTED, SERIALIZABLE, REPEATABLE_READ
maximum.connections	10	Max connections allowed to db
maximum.capacity	0 (unlimited)	Max number of events in the channel
sysprop.*		DB Vendor specific properties
sysprop.user.home		Home path to store embedded Derby database

可恢复内存通道

Property Name	Default	Description
type	-	The component type name, needs to be org.apache.flume.channel.recoverable.memory.RecoverableMemoryChannel
wal.dataDir	(${user.home}/.flume/recoverable-memory-channel
wal.rollSize	(0x04000000)	Max size (in bytes) of a single file before we roll
wal.minRetentionPeriod	300000	Min amount of time (in millis) to keep a log
wal.workerInterval	60000	How often (in millis) the background worker checks for old logs
wal.maxLogsSize	(0x20000000)	Total amt (in bytes) of logs to keep, excluding the current log

文件通道

NOTE: 目前还不可用

Property Name	Default	Description
type	-	The component type name, needs to be org.apache.flume.channel.file.FileChannel

伪事务通道

备注: 仅仅用来测试目的，不是在生产环境中使用。

Property Name	Default	Description
type	-	The component type name, needs to be org.apache.flume.channel.PseudoTxnMemoryChannel
capacity	50	The max number of events stored in the channel
keep-alive	3	Timeout in seconds for adding or removing an event