Apache Flume 学习笔记
# 从http://flume.apache.org/download.html 下载flume
#############################################
# 概述:Flume 是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的软件。
# Flume的核心是把数据从数据源(source)收集过来,送到指定的目的地(sink)。为了保证输送的过程一定
# 成功,在送到目的地(sink)之前,会先缓存数据(channel),待数据真正到达目的地(sink)后,再删除自
# 己缓存的数据。
#############################################
# 上传到Linux,
tar zxvf apache-flume-1.8.-bin.tar.gz
rm -rf apache-flume-1.8.-bin.tar.gz
mv apache-flume-1.8.-bin/ flume-1.8.
cd flume-1.8./conf/
cp flume-env.sh.template flume-env.sh vim flume-env.sh
# 导入正确的JDK路径
export JAVA_HOME=/usr/local/src/jdk1..0_161 ########################################
# 从网络端口接收数据,下沉到logger
######################################## 采集配置文件,netcat-logger.conf # Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = # Describe the sinks
a1.sinks.k1.type = logger # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ######################################## 采集配置文件 结束 # 启动命令
bin/flume-ng agent --conf conf/ --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console
# 将出现监听: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:]
# 用另一个终端来测试:
yum install -y telnet
telnet localhost # 登录成功会显示 Connected to localhost. Escape character is '^]'.
hello, world. # 发送一段文字。 看启动监听的终端有没有收到。
# 监听端:-- ::, (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:)] Event: { headers:{} body: 6C 6C 6F 2C 6F 6C 2E 0D hello,world.. } ##########################################
# 采集目录到HDFS上。# 启动好HDFS,
################################## spooldir-hdfs.cnf 文件: #Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the sources
# 注意不能往监控目录中重复放置同名文件,一旦重名,服务将出错并停止。
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs
a1.sources.r1.fileHeader = true # Describe the sinks
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue =
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval =
a1.sinks.k1.hdfs.rollSize =
a1.sinks.k1.hdfs.rollCount =
a1.sinks.k1.hdfs.batchSize =
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 生成的文件类型,默认是Sequencefile, 可用DataStream ,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ################################## # 启动命令 如果/root/logs中已有文件,则会被立刻采集到HDFS
bin/flume-ng agent -c conf/ -f conf/spooldir-hdfs.cnf -n a1 -Dflume.root.logger=INFO,console
# 成功后:-- ::, (lifecycleSupervisor--) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:)] Component type: SOURCE, name: r1 started # 在/root/logs/下创建一个文件,监听端会显示:Writer callback called.
# HDFS上则得到文件:/flume/events/--//events-.
# 注意 spooldir 不能往源目录/root/logs/中重复放置同名文件,一旦重名,服务将出错并停止工作。 ##########################################
### 增量采集内容变化的文件到HDFS
########################################## tail-hdfs.cnf 文件 #Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the sources
# 注意不能往监控目录中重复放置同名文件,一旦重名,服务将出错并停止。
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/test.log
a1.sources.r1.channels = c1 # Describe the sinks
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue =
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval =
a1.sinks.k1.hdfs.rollSize =
a1.sinks.k1.hdfs.rollCount =
a1.sinks.k1.hdfs.batchSize =
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 生成的文件类型,默认是Sequencefile, 可用DataStream ,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ##########################################
# 启动命令 如果/root/logs中已有文件,则会被立刻采集到HDFS
bin/flume-ng agent -c conf -f conf/tail-hdfs.cnf -n a1 -Dflume.root.logger=INFO,console # 模拟数据不断写入.
while true; do date >>/root/logs/test.log;sleep 1.5;done ##########################################
#Load balance 负载均衡
##########################################
# 使用三台机器,设置二级flume, 前面一台采集,使用轮询方式发往后面的二台,后二台再收集前一台发来的数据,下沉到目标。
scp -r flume-1.8./ slave2:/usr/local/src/
scp -r flume-1.8./ slave3:/usr/local/src/ # 使用slave1在最前,slave2 , slave3在其后的方式。 ################# 第一级slave1 配置文件:exec-avro.cnf #agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2 # set group
agent1.sinkgroups = g1 # set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity =
agent1.channels.c1.transactionCapacity = agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/.log # set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = slave2
agent1.sinks.k1.port = # set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = slave3
agent1.sinks.k2.port = # set sink group
agent1.sinkgroups.g1.sinks = k1 k2 # set failover
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
agent1.sinkgroups.g1.processor.selector.maxTimeOut = ############# end ############## ################# 第二级slave2 配置文件:avro-logger.cnf # Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the sources
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = slave2
a1.sources.r1.port = # Describe the sinks
a1.sinks.k1.type = logger # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ############# slave2 end ############## ################# 第二级slave3 配置文件:avro-logger.cnf 唯一的改变是slave3 # Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the sources
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = slave3
a1.sources.r1.port = # Describe the sinks
a1.sinks.k1.type = logger # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ############# slave3 end ############## ## 先启动第二级的slave2, slave3
bin/flume-ng agent -c conf -f conf/avro-logger.cnf -n a1 -Dflume.root.logger=INFO,console
## 再启动一级的slave1
bin/flume-ng agent -c conf -f conf/exec-avro.cnf -n agent1 -Dflume.root.logger=INFO,console
# 启动成功后,第二级终端会出现类似:CONNECTED: /192.168.112.11:
# 而后续终止第一级时,第二级会出现类似: /192.168.112.11: disconnected. # 模拟数据写入. 会看到仅第二级有采集动作,第一级不作显示。
while true; do date >>/root/logs/.log;sleep ;done #############################################
# Failover 容错
# 同一时间后端只有一台机器工作.
#############################################
# 还是使用三台机器,设置二级flume, 前面一台采集,发往后面的某一台,优先级最高的收集前一台发来的数据;
# 如果这台机器挂了,另一台自动替补
scp -r flume-1.8./ slave2:/usr/local/src/
scp -r flume-1.8./ slave3:/usr/local/src/ # 使用slave1在最前,slave2 , slave3在其后的方式。 ################# 第一级slave1 配置文件:exec-avro.cnf #agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2 # set group
agent1.sinkgroups = g1 # set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity =
agent1.channels.c1.transactionCapacity = agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/.log # set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = slave2
agent1.sinks.k1.port = # set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = slave3
agent1.sinks.k2.port = # set sink group
agent1.sinkgroups.g1.sinks = k1 k2 # set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 =
agent1.sinkgroups.g1.processor.priority.k2 =
agent1.sinkgroups.g1.processor.maxpenalty = ############# end ############## ################# 第二级slave2 配置文件:avro-logger.cnf # Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the sources
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = slave2
a1.sources.r1.port = # Describe the sinks
a1.sinks.k1.type = logger # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ############# slave2 end ############## ################# 第二级slave3 配置文件:avro-logger.cnf 唯一的改变是slave3 # Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the sources
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = slave3
a1.sources.r1.port = # Describe the sinks
a1.sinks.k1.type = logger # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ############# slave3 end ############## ## 先启动第二级的slave3, slave2
bin/flume-ng agent -c conf -f conf/avro-logger.cnf -n a1 -Dflume.root.logger=INFO,console
## 再启动一级的slave1
bin/flume-ng agent -c conf -f conf/exec-avro.cnf -n agent1 -Dflume.root.logger=INFO,console
# 启动成功后,第二级终端会出现类似:CONNECTED: /192.168.112.11:
# 而后续终止第一级时,第二级会出现类似: /192.168.112.11: disconnected. # 模拟数据写入. 会看到仅第二级slave2有采集动作,第一级不作显示。slave3待命。
while true; do date >>/root/logs/.log;sleep ;done
# 一旦slave2终止,则slave3自动顶上,继续接收。
更新一个练习:
################################################################
# 案例:
# A、B两台日志服务器实时生产日志,主要类型为access.log, nginx.log, web.log
# 要求:把A、B中的三种日志采集汇总到C机器上,然后收集到HDFS
# 且HDFS中要求按类别存放到不同的目录
################################################################
### 现将slave1 slave2 slave3 分别对应A B C
### A & B 配置文件 exec_source_avro_sink.conf 基本上一样,仅hostname不一样 # Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1 # Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs1/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs1/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs1/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web # Describe the sink 发送到下一级主机
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = slave3
a1.sinks.k1.port = # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # Bind the sourceand sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
### end ### ### C 配置文件 avro_source_hdfs_sink.conf # 定义agent名, source channel sink的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # 定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = slave3
a1.sources.r1.port = # 添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder # 定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity =
a1.channels.c1.transactionCapacity = # 定义sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://master:9000/source/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text # 时间类型
# a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount =
# 生成的文件不按时间生成
a1.sinks.k1.hdfs.rollInterval =
# 生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize =
# 批量写入HDFS的个数
a1.sinks.k1.hdfs.batchSize =
# flume操作hdfs的线程数(包括新建,写入等)
a1.sinks.k1.hdfs.threadsPoolSize =
# 操作hdfs超时时间
a1.sinks.k1.hdfs.callTimeout = # 组装source channel sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 ### end ### ## 先启动第二级的slave2
bin/flume-ng agent -c conf -f conf/avro_source_hdfs_sink.conf -name a1 -Dflume.root.logger=DEBUG,console
## 再启动一级的slave1
bin/flume-ng agent -c conf -f conf/exec_source_avro_sink.conf -name a1 -Dflume.root.logger=DEBUG,console
# 启动成功后,slave2会出现类似:CONNECTED: /192.168.112.11: # 模拟数据写入.
while true; do echo "access.. `date` " >>/root/logs1/access.log;sleep ;done
while true; do echo "nginx.. `date` " >>/root/logs1/nginx.log;sleep ;done
while true; do echo "web.. `date` " >>/root/logs1/web.log;sleep ;done # 查看hdfs上采集成功。
今天的练习完成,成功了。
Apache Flume 学习笔记的更多相关文章
- Apache Flink学习笔记
Apache Flink学习笔记 简介 大数据的计算引擎分为4代 第一代:Hadoop承载的MapReduce.它将计算分为两个阶段,分别为Map和Reduce.对于上层应用来说,就要想办法去拆分算法 ...
- Apache OFBiz 学习笔记 之 服务引擎 二
加载服务定义文件 ofbiz-component.xml:所有的服务定义文件在每个组件的ofbi-component.xml文件中 加载服务定义 例:framework/common/ofbi ...
- Apache Ignite 学习笔记(一): Ignite介绍、部署安装和REST/SQL客户端使用
Apache Ignite 介绍 Ignite是什么呢?先引用一段官网关于Ignite的描述: Ignite is memory-centric distributed database, cachi ...
- flume学习笔记——安装和使用
Flume是一个分布式.可靠.和高可用的海量日志聚合的系统,支持在系统中定制各类数据发送方,用于收集数据:同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力. Flume是一 ...
- Apache Flume 学习
Apache Flume,又称Flume NG (next generation),前身是Cloudera公司的Flume项目 -- 又称Flume OG. 这货的功能就是从源中将数据收集到指定的目的 ...
- flume学习笔记
#################################################################################################### ...
- Flume 学习笔记之 Flume NG+Kafka整合
Flume NG集群+Kafka集群整合: 修改Flume配置文件(flume-kafka-server.conf),让Sink连上Kafka hadoop1: #set Agent name a1. ...
- Apache Lucene学习笔记
Hadoop概述 Apache lucene: 全球第一个开源的全文检索引擎工具包 完整的查询引擎和搜索引擎 部分文本分析引擎 开发人员在此基础建立完整的全文检索引擎 以下为转载:http://www ...
- apache activemq 学习笔记
0.activemq的概念 activemq实现了jms(java Message server),用于接收,发送,处理消息的开源消息总线. 1.activemq和jms的区别 jms说白了就是jav ...
随机推荐
- h5 input 的验证
<input type="text" id="a" required/> <input type="text" id=&q ...
- 看懂MSSQL执行计划,分析SQL语句执行情况
打开SQL执行计划窗口 执行计划的图表是从右向左看的 SQL Server有几种方式查找数据记录 [Table Scan] 表扫描(最慢),对表记录逐行进行检查 [Clustered Index Sc ...
- Centos7.4配置虚拟环境
environment Centos7.4 Python3.7 download pip isntall virtualenv create environment virtualenv enviro ...
- nginx 反向代理和正向代理功能 第六章
一:Nginx作为正向代理服务器: 1.正向代理:代理(proxy)服务也可以称为是正向代理,指的是将服务器部署在公司的网关,代理公司内部员工上外网的请求,可以起到一定的安全作用和管理限制作用,正向代 ...
- layer(jQuery弹出层插件)
弹窗alert:默认确定按钮+右上角关闭 top.layer.alert("请选择要删除的记录!",{shade: 0.3,offset:'250px'}); 弹窗alert:默认 ...
- css基础参考文档
block inline-block inline区别 absolute定位详解:https://www.jianshu.com/p/a3da5e27d22b css浮动详解 float浮动 div变 ...
- 【Python】【自动化测试】【pytest】
https://docs.pytest.org/en/latest/getting-started.html#create-your-first-test http://www.testclass.n ...
- Mac redis安装
Download, extract and compile Redis with: #进入下载目录 $ cd ... $ wget http://download.redis.io/releases/ ...
- 用jQuery实现参数自定义的文字跑马灯效果
一,明确需求 基本需求:最近在工作中接到一个新需求,简单来说就是实现一行文字从右到左跑马灯的效果,并且以固定的时间间隔进行循环. 原本这是一个很容易实现的需求,但是难点是要求很多参数得是用户可自行设置 ...
- DAY14 函数(三)
一.三元表达式 三元运算符:就是if...else...的语法糖但是只支持只有一条if...else...语句的判断 原: cmd = input('cmd:') if cmd.isdigit(): ...