Flume - Kafka日志平台整合
1. Flume介绍
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统,Flume支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力。
agent
agent本身是一个Java进程,运行在日志收集节点—所谓日志收集节点就是服务器节点。
agent里面包含3个核心的组件:source—->channel—–>sink,类似生产者、仓库、消费者的架构。source
source组件是专门用来收集数据的,可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定义。channel
source组件把数据收集来以后,临时存放在channel中,即channel组件在agent中是专门用来存放临时数据的——对采集到的数据进行简单的缓存,可以存放在memory、jdbc、file等等。sink
sink组件是用于把数据发送到目的地的组件,目的地包括hdfs、logger、avro、thrift、ipc、file、null、Hbase、solr、自定义。event
将传输的数据进行封装,是flume传输数据的基本单位,如果是文本文件,通常是一行记录,event也是事务的基本单位。event从source,流向channel,再到sink,本身为一个字节数组,并可携带headers(头信息)信息。event代表着一个数据的最小完整单元,从外部数据源来,向外部的目的地去。
2. Kafka Channel && Kafka Sink
2.1 Kafka channel
Kafka channel可以应用在多样的场景中:
- Flume source and sink:
可以为event提供一个高可靠性和高可用的channel; - Flume source and interceptor but no sink:
其他应用可以将Fluem event写入kafka topic中; - With Flume sink, but no source:
提供低延迟、高容错的方式将Fluem event从kafka中写入其他sink,例如:HDFS,HBase或者Solr。
- Kafka Channel配置
加粗部分为必填属性。
Property Name | Default | Description |
---|---|---|
type | – | The component type name, needs to be org.apache.flume.channel.kafka.KafkaChannel |
kafka.bootstrap.servers | – | List of brokers in the Kafka cluster used by the channel This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port |
kafka.topic | flume-channel | Kafka topic which the channel will use |
kafka.consumer.group.id | flume | Consumer group ID the channel uses to register with Kafka. Multiple channels must use the same topic and group to ensure that when one agent fails another can get the data Note that having non-channel consumers with the same ID can lead to data loss. |
parseAsFlumeEvent | true | Expecting Avro datums with FlumeEvent schema in the channel. This should be true if Flume source is writing to the channel and false if other producers are writing into the topic that the channel is using. Flume source messages to Kafka can be parsed outside of Flume by using org.apache.flume.source.avro.AvroFlumeEvent provided by the flume-ng-sdk artifact |
migrateZookeeperOffsets | true | When no Kafka stored offset is found, look up the offsets in Zookeeper and commit them to Kafka. This should be true to support seamless Kafka client migration from older versions of Flume. Once migrated this can be set to false, though that should generally not be required. If no Zookeeper offset is found the kafka.consumer.auto.offset.reset configuration defines how offsets are handled. |
pollTimeout | 500 | The amount of time(in milliseconds) to wait in the “poll()” call of the consumer. https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#poll(long) |
defaultPartitionId | – | Specifies a Kafka partition ID (integer) for all events in this channel to be sent to, unless overriden by partitionIdHeader. By default, if this property is not set, events will be distributed by the Kafka Producer’s partitioner - including by key if specified (or by a partitioner specified by kafka.partitioner.class). |
partitionIdHeader | – | When set, the producer will take the value of the field named using the value of this property from the event header and send the message to the specified partition of the topic. If the value represents an invalid partition the event will not be accepted into the channel. If the header value is present then this setting overrides defaultPartitionId. |
kafka.consumer.auto.offset.reset | latest | What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): earliest: automatically reset the offset to the earliest offset latest: automatically reset the offset to the latest offset none: throw exception to the consumer if no previous offset is found for the consumer’s group anything else: throw exception to the consumer. |
kafka.producer.security.protocol | PLAINTEXT | Set to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup. |
kafka.consumer.security.protocol | PLAINTEXT | Same as kafka.producer.security.protocol but for reading/consuming from Kafka. |
more producer/consumer security props | If using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on producer/consumer. |
2.2 Kafka Sink
Flume 支持将数据发布到一个kafka topic。目前支持Kafka 0.9.x版本。
- KafkaSink 配置
加粗部分为必填配置
Property | Name | Default Description |
---|---|---|
type | – | Must be set to org.apache.flume.sink.kafka.KafkaSink |
kafka.bootstrap.servers | – | List of brokers Kafka-Sink will connect to, to get the list of topic partitions This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port |
kafka.topic | default-flume-topic | The topic in Kafka to which the messages will be published. If this parameter is configured, messages will be published to this topic. If the event header contains a “topic” field, the event will be published to that topic overriding the topic configured here. Arbitrary header substitution is supported, eg. %{header} is replaced with value of event header named “header”. (If using the substitution, it is recommended to set “auto.create.topics.enable” property of Kafka broker to true.) |
flumeBatchSize | 100 | How many messages to process in one batch. Larger batches improve throughput while adding latency. |
kafka.producer.acks | 1 | How many replicas must acknowledge a message before its considered successfully written. Accepted values are 0 (Never wait for acknowledgement), 1 (wait for leader only), -1 (wait for all replicas) Set this to -1 to avoid data loss in some cases of leader failure. |
useFlumeEventFormat | false | By default events are put as bytes onto the Kafka topic directly from the event body. Set to true to store events as the Flume Avro binary format. Used in conjunction with the same property on the KafkaSource or with the parseAsFlumeEvent property on the Kafka Channel this will preserve any Flume headers for the producing side. |
defaultPartitionId | – | Specifies a Kafka partition ID (integer) for all events in this channel to be sent to, unless overriden by partitionIdHeader. By default, if this property is not set, events will be distributed by the Kafka Producer’s partitioner - including by key if specified (or by a partitioner specified by kafka.partitioner.class). |
partitionIdHeader | – | When set, the sink will take the value of the field named using the value of this property from the event header and send the message to the specified partition of the topic. If the value represents an invalid partition, an EventDeliveryException will be thrown. If the header value is present then this setting overrides defaultPartitionId. |
allowTopicOverride | true | When set, the sink will allow a message to be produced into a topic specified by the topicHeader property (if provided). |
topicHeader | topic | When set in conjunction with allowTopicOverride will produce a message into the value of the header named using the value of this property. Care should be taken when using in conjunction with the Kafka Source topicHeader property to avoid creating a loopback. |
kafka.producer.security.protocol | PLAINTEXT | Set to SASL_PLAINTEXT, SASL_SSL or SSL if writing to Kafka using some level of security. See below for additional info on secure setup. |
more producer security props | If using SASL_PLAINTEXT, SASL_SSL or SSL refer to Kafka security for additional properties that need to be set on producer. | |
Other Kafka Producer Properties | – | These properties are used to configure the Kafka Producer. Any producer property supported by Kafka can be used. The only requirement is to prepend the property name with the prefix kafka.producer. For example: kafka.producer.linger.ms |
3. Flume - Kafka配置示例
切换到flume/conf目录下,编辑配置文件:
agent.sources = s1
agent.channels = c1
agent.sinks = k1
# Source Config
agent.sources.s1.type = spooldir
agent.sources.s1.channels = c1
agent.sources.s1.bind = 192.168.100.105
agent.sources.s1.port = 9696
agent.sources.s1.includePattern = *.log
agent.sources.s1.spoolDir = /home/usr/tomcat-test/logs
# Sink Config
## 输出到kafka
agent.sinks.s1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.s1.channel = c1
agent.sinks.s1.topic = test_tomcat_logs
agent.sinks.s1.serializer.class = kafka.serializer.StringEncoder
agent.sinks.s1.brokerList = 192.168.100.105:9092
# Channel Config
agent.channels.c1.type = memory
agent.channels.c1.keep-alive = 10
agent.channels.c1.capacity = 65535
很明显,由配置文件可以了解到:
我们需要读取目录:
/home/usr/tomcat-test/logs
下日志文件;flume连接到kafka的地址是
192.168.100.105:9092
,注意不要配置出错了;flume会将采集后的内容输出到Kafka topic 为
test_tomcat_logs
,所以我们启动zk,kafka后需要打开一个终端消费topic kafkatest的内容。这样就可以看到flume与kafka之间开始工作了。
4. 运行
运行flume直接切换到flume目录执行以下命令即可:
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
参考资料:
[1] Flume Doc:
http://flume.apache.org/FlumeUserGuide.html#kafka-channel
Flume - Kafka日志平台整合的更多相关文章
- flume+kafka+spark streaming整合
1.安装好flume2.安装好kafka3.安装好spark4.流程说明: 日志文件->flume->kafka->spark streaming flume输入:文件 flume输 ...
- Flume+Kafka+storm的连接整合
Flume-ng Flume是一个分布式.可靠.和高可用的海量日志采集.聚合和传输的系统. Flume的文档可以看http://flume.apache.org/FlumeUserGuide.html ...
- 【转】flume+kafka+zookeeper 日志收集平台的搭建
from:https://my.oschina.net/jastme/blog/600573 flume+kafka+zookeeper 日志收集平台的搭建 收藏 jastme 发表于 10个月前 阅 ...
- 基于Flume+Kafka+ Elasticsearch+Storm的海量日志实时分析平台(转)
0背景介绍 随着机器个数的增加.各种服务.各种组件的扩容.开发人员的递增,日志的运维问题是日渐尖锐.通常,日志都是存储在服务运行的本地机器上,使用脚本来管理,一般非压缩日志保留最近三天,压缩保留最近1 ...
- Flume+Kafka+Storm+Hbase+HDSF+Poi整合
Flume+Kafka+Storm+Hbase+HDSF+Poi整合 需求: 针对一个网站,我们需要根据用户的行为记录日志信息,分析对我们有用的数据. 举例:这个网站www.hongten.com(当 ...
- Flume+Kafka+Storm整合
Flume+Kafka+Storm整合 1. 需求: 有一个客户端Client可以产生日志信息,我们需要通过Flume获取日志信息,再把该日志信息放入到Kafka的一个Topic:flume-to-k ...
- Flume+Kafka整合
脚本生产数据---->flume采集数据----->kafka消费数据------->storm集群处理数据 日志文件使用log4j生成,滚动生成! 当前正在写入的文件在满足一定的数 ...
- 基于Kafka+ELK搭建海量日志平台
早在传统的单体应用时代,查看日志大都通过SSH客户端登服务器去看,使用较多的命令就是 less 或者 tail.如果服务部署了好几台,就要分别登录到这几台机器上看,等到了分布式和微服务架构流行时代,一 ...
- 日志=>flume=>kafka=>spark streaming=>hbase
日志=>flume=>kafka=>spark streaming=>hbase 日志部分 #coding=UTF-8 import random import time ur ...
随机推荐
- 《android开发艺术探索》读书笔记(十三)--综合技术
接上篇<android开发艺术探索>读书笔记(十二)--Bitmap的加载和Cache No1: 使用CrashHandler来获取应用的crash信息 No2: 在Android中单个d ...
- HDU - 1043 A* + 康托 [kuangbin带你飞]专题二
这题我第一次用的bfs + ELFhash,直接TLE,又换成bfs + 康托还是TLE,5000ms都过不了!!我一直调试,还是TLE,我才发觉应该是方法的问题. 今天早上起床怒学了一波A*算法,因 ...
- 聊聊JavaScript-闭包
今天聊聊闭包,网上五花八门的定义和解释很多很多,是不是搞得你很懵逼:每次看闭包,都不同,本来自己懂,看完别人的之后就开始怀疑自己了.在我看来,闭包简单的说就是函数里面套函数,再往大了说就是我函数外面想 ...
- SpringBoot+Mybatis+PageHelper简化分页实现
前言 经过一段时间的测试和修改PageHelper插件逐渐走到了让我觉得靠谱的时候,它功能的就是简化分页的实现,让分页不需要麻烦的多写很多重复的代码. 已经加入我的github模版中:https:// ...
- sqlserver中select造成死锁
死锁过程: select语句使用非聚族索引查询产量信息,会对非聚族索引添加共享锁,由于非聚族索引上没有select的全部数据列,(所以会有书签查找出现,)需要查询产量表.查询产量表时,需要对产量表数据 ...
- CSS3之background-clip
1.属性简介 background-clip:padding|border|content|text|!important 2.兼容性 (1)IE6.7.8不兼容 (2)火狐3.0以上兼容 (3)Ch ...
- VxWorks启动过程详解(上)
vxworks有三种映像: VxWorks Image的文件类型有三种 Loadable Images:由Boot-ROM引导通过网口或串口下载到RAM ROM-based Images(压缩/没有压 ...
- java.sql.SQLException:Column Index out of range,0<1
1.错误描述 java.sql.SQLException:Column Index out of range,0<1 2.错误原因 try { Class.forName("com.m ...
- Flex中利用单选按钮切换柱状图横纵坐标以及描述
1.问题描述 一组单选按钮,有周和月之分,选择"周",柱状图横坐标显示的是周,纵坐标显示的是人数:选择"月",柱状图横坐标显示的月,纵坐标显示的是比率. 2.演 ...
- 多线程下不重复读取SQL Server 表的数据
在进行一些如发送短信.邮件的业务时,我们经常会使用一个表来存储待发送的数据,由后台多个线程不断的从表中读取待发送的数据进行发送,发送完成后再将数据转移到历史表中,这样保证待发送表的数据一般情况下不会太 ...