Flume + HDFS + Hive日志收集系统
最近一段时间,负责公司的产品日志埋点与收集工作,搭建了基于Flume+HDFS+Hive日志搜集系统。
一、日志搜集系统架构:
简单画了一下日志搜集系统的架构图,可以看出,flume承担了agent与collector角色,HDFS承担了数据持久化存储的角色。
作者搭建的服务器是个demo版,只用到了一个flume_collector,数据只存储在HDFS。当然高可用的日志搜集处理系统架构是需要多台flume collector做负载均衡与容错处理的。

二、日志产生:
1、log4j配置,每隔1分钟roll一个文件,如果1分钟之内文件大于5M,则再生成一个文件。
<!-- 产品数据分析日志 按分钟分 -->
<RollingRandomAccessFile name="RollingFile_product_minute"
fileName="${STAT_LOG_HOME}/${SERVER_NAME}_product.log"
filePattern="${STAT_LOG_HOME}/${SERVER_NAME}_product.log.%d{yyyy-MM-dd-HH-mm}-%i">
<PatternLayout charset="UTF-8"
pattern="%d{yyyy-MM-dd HH:mm:ss.SSS} %level - %msg%xEx%n" />
<Policies>
<TimeBasedTriggeringPolicy interval="1"
modulate="true" />
<SizeBasedTriggeringPolicy size="${EVERY_FILE_SIZE}" />
</Policies>
<Filters>
<ThresholdFilter level="INFO" onMatch="ACCEPT"
onMismatch="NEUTRAL" />
</Filters>
</RollingRandomAccessFile>
roll后的文件格式如下

2、日志内容
json格式文件,最外层json按顺序为:tableName,logRequest,timestamp,statBody,logResponse,resultCode,resultMsg
2016-11-30 09:18:21.916 INFO - {
"tableName": "ReportView",
"logRequest": {
***
},
"timestamp": 1480468701432,
"statBody": {
***
},
"logResponse": {
***
},
"resultCode": 1,
"resultFailMsg": ""
}
三、flume配置
虚拟机环境,请见我的博客http://www.cnblogs.com/xckk/p/6000881.html
hadoop环境,请见我的另一篇博客http://www.cnblogs.com/xckk/p/6124553.html
此处flume环境是
centos1:flume-agent
centos2:flume-collector
1、flume agent配置,conf文件
a1.sources = skydataSource a1.channels = skydataChannel a1.sinks = skydataSink a1.sources.skydataSource.type = spooldir a1.sources.skydataSource.channels = skydataChannel #日志目录 a1.sources.skydataSource.spoolDir = /opt/flumeSpool a1.sources.skydataSource.fileHeader = true #日志内容处理完后,会生成.COMPLETED后缀的文件,同时.log文件每一分钟roll一个,此处忽略.log文件与.COMPLETED文件 a1.sources.skydataSource.ignorePattern=([^_]+)|(.*(\.log)$)|(.*(\.COMPLETED)$) a1.sources.skydataSource.basenameHeader=true a1.sources.skydataSource.deserializer.maxLineLength= #自定义拦截器,对json格式的源日志进行字段分隔,并添加timestamp,为后面的hdfsSink做处理,拦截器代码见后面 a1.sources.skydataSource.interceptors=i1 a1.sources.skydataSource.interceptors.i1.type=com.skydata.flume_interceptor.HiveLogInterceptor2$Builder a1.sinks.skydataSink.type = avro a1.sinks.skydataSink.channel = skydataChannel a1.sinks.skydataSink.hostname = centos2 a1.sinks.skydataSink.port = #此处配置deflate压缩后,hive collector那边一定也要相应配置解压缩 a1.sinks.skydataSink.compression-type=deflate a1.channels.skydataChannel.type=memory a1.channels.skydataChannel.capacity= a1.channels.skydataChannel.transactionCapacity=
2、flume collector配置
a1.sources = avroSource a1.channels = memChannel a1.sinks = hdfsSink a1.sources.avroSource.type = avro a1.sources.avroSource.channels = memChannel a1.sources.avroSource.bind=centos2 a1.sources.avroSource.port= #与flume agent配置对应 a1.sources.avroSource.compression-type=deflate a1.sinks.hdfsSink.type = hdfs a1.sinks.hdfsSink.channel = memChannel # skydata_hive_log为hive表,按年-月-日分区存储, a1.sinks.hdfsSink.hdfs.path=hdfs://centos1:9000/flume/skydata_hive_log/dt=%Y-%m-%d a1.sinks.hdfsSink.hdfs.batchSize= a1.sinks.hdfsSink.hdfs.fileType=DataStream a1.sinks.hdfsSink.hdfs.writeFormat=Text a1.sinks.hdfsSink.hdfs.rollSize= a1.sinks.hdfsSink.hdfs.rollCount= a1.sinks.hdfsSink.hdfs.rollInterval= a1.channels.memChannel.type=memory a1.channels.memChannel.capacity= a1.channels.memChannel.transactionCapacity=
四、hive表创建与分区
1、hive表创建
在hive中执行建表语句后,hdfs://centos1:9000/flume/目录下新生成了skydata_hive_log目录。(建表语句里面有location关键字)
\u0001表示hive通过该分隔符进行字段分离,该字符在linux用vim编辑器打开是^A。
由于日志格式是JSON格式,因为需要将JSON格式转换成\u0001字符分隔,并通过dt进行分区。这一步通过flume自定义拦截器来完成。
CREATE TABLE `skydata_hive_log`( `tableNmae` string, `logRequest` string, `timestamp` bigint, `statBody` string, `logResponse` string, `resultCode` int, `resultFailMsg` string ) PARTITIONED BY ( `dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://centos1:9000/flume/skydata_hive_log';
2、hive表分区
for ((i=-;i<=;i++))
do dt=$(date -d "$(date +%F) ${i} days" +%Y-%m-%d) echo date=$dt hive -e "ALTER TABLE skydata_hive_log ADD PARTITION(dt='${dt}')" >> logs/init_skydata_hive_log.out >>logs/init_skydata_hive_log.err done
五、自定义flume拦截器
新建maven工程,拦截器HiveInterceptor2代码如下。
package com.skydata.flume_interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.apache.flume.interceptor.TimestampInterceptor.Constants;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.alibaba.fastjson.JSONObject;
import com.google.common.base.Charsets;
import com.google.common.base.Joiner;
public class HiveLogInterceptor2 implements Interceptor
{
private static Logger logger = LoggerFactory.getLogger(HiveLogInterceptor2.class);
public static final String HIVE_SEPARATOR = "\001";
public void close()
{
// TODO Auto-generated method stub
}
public void initialize()
{
// TODO Auto-generated method stub
}
public Event intercept(Event event)
{
String orginalLog = new String(event.getBody(), Charsets.UTF_8);
try
{
String log = this.parseLog(orginalLog);
// 设置时间,用于hdfsSink
long now = System.currentTimeMillis();
Map headers = event.getHeaders();
headers.put(Constants.TIMESTAMP, Long.toString(now));
event.setBody(log.getBytes());
} catch (Throwable throwable)
{
logger.error(("errror when intercept,log [ " + orginalLog + " ] "), throwable);
return null;
}
return event;
}
public List<Event> intercept(List<Event> list)
{
List<Event> events = new ArrayList<Event>();
for (Event event : list)
{
Event interceptedEvent = this.intercept(event);
if (interceptedEvent != null)
{
events.add(interceptedEvent);
}
}
return events;
}
private static String parseLog(String log)
{
List<String> logFileds = new ArrayList<String>();
String dt = log.substring(0, 10);
String keyStr = "INFO - ";
int index = log.indexOf(keyStr);
String content = "";
if (index != -1)
{
content = log.substring(index + keyStr.length(), log.length());
}
//针对不同OS,使用不同回车换行符号
content = content.replaceAll("\r", "");
content = content.replaceAll("\n", "\\\\" + System.getProperty("line.separator"));
JSONObject jsonObj = JSONObject.parseObject(content);
String tableName = jsonObj.getString("tableName");
String logRequest = jsonObj.getString("logRequest");
String timestamp = jsonObj.getString("timestamp");
String statBody = jsonObj.getString("statBody");
String logResponse = jsonObj.getString("logResponse");
String resultCode = jsonObj.getString("resultCode");
String resultFailMsg = jsonObj.getString("resultFailMsg");
//字段分离
logFileds.add(tableName);
logFileds.add(logRequest);
logFileds.add(timestamp);
logFileds.add(statBody);
logFileds.add(logResponse);
logFileds.add(resultCode);
logFileds.add(resultFailMsg);
logFileds.add(dt);
return Joiner.on(HIVE_SEPARATOR).join(logFileds);
}
public static class Builder implements Interceptor.Builder
{
public Interceptor build()
{
return new HiveLogInterceptor2();
}
public void configure(Context arg0)
{
}
}
}
pom.xml增加如下配置,将flume拦截器工程进行maven打包,jar包与依赖包均拷到${flume-agent}/lib目录
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<configuration>
<outputDirectory>
${project.build.directory}
</outputDirectory>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>prepare-package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>true</overWriteReleases>
<overWriteSnapshots>true</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
对日志用分隔符"\001"进行分隔,。经拦截器处理后的日志格式如下,^A即是"\001"
ReportView^A{"request":{},"requestBody":{"detailInfos":[],"flag":"","reportId":7092,"pageSize":0,"searchs":[],"orders":[],"pageNum":1}}^A1480468701432^A{"sourceId":22745,"reportId":7092,"projectId":29355,"userId":2532}^A{"responseBody":{"statusCodeValue":200,"httpHeaders":{},"body":{"msg":"请求成功","httpCode":200,"timestamp":1480468701849},"statusCode":"OK"},"response":{}}^A1^A^A2016-11-30
至此,flume+Hdfs+Hive的配置均已完成。
后续可以通过mapreduce或者HQL对数据进行分析。
六、启动运行与结果
1、启动hadoop hdfs
参考我的前一篇文章:hadoop 1.2 集群搭建与环境配置 http://www.cnblogs.com/xckk/p/6124553.html
2、启动flume_collector和flume_agent,由于flume启动命令参数太多,自己写了一个启动脚本
start-Flume.sh
#!/bin/bash
jps -l|grep org.apache.flume.node.Application|awk '{print $1}'|xargs kill - >& >/dev/null
cd "$(dirname "$")"
cd ..
nohup bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name a1 >& > /dev/null &
3、hdfs查看数据
可以看到搜集的日志已经上传到HDFS上
[root@centos1 bin]# rm -rf FlumeData..tmp
[root@centos1 bin]# hadoop fs -ls /flume/skydata_hive_log/dt=--/
Found items
-rw-r--r-- root supergroup -- : /flume/skydata_hive_log/dt=--/FlumeData..tmp
-rw-r--r-- root supergroup -- : /flume/skydata_hive_log/dt=--/FlumeData.
-rw-r--r-- root supergroup -- : /flume/skydata_hive_log/dt=--/FlumeData.
[root@centos1 bin]#
4、启动hive,查看数据,可以看到hive已经可以加载hdfs数据
[root@centos1 lib]# hive Logging initialized using configuration in file:/root/apache-hive-1.2.-bin/conf/hive-log4j.properties
hive> select * from skydata_hive_log limit ;
OK
ReportView {"request":{},"requestBody":{"detailInfos":[],"flag":"","reportId":,"pageSize":,"searchs":[],"orders":[],"pageNum":}} {"sourceId":,"reportId":,"projectId":,"userId":} {"responseBody":{"statusCodeValue":,"httpHeaders":{},"body":{"msg":"请求成功","httpCode":,"timestamp":},"statusCode":"OK"},"response":{}} --
ReportDesignResult {"request":{},"requestBody":{"sourceId":,"detailInfos":[{"colName":"月份","flag":"","reportId":,"colCode":"col_2_22745","pageSize":,"type":"","pageNum":,"rcolCode":"col_25538","colType":"string","formula":"","id":,"position":"row","colId":,"dorder":,"pColName":"月份","pRcolCode":"col_25538"},{"colName":"综合利率(合计)","flag":"","reportId":,"colCode":"col_11_22745","pageSize":,"type":"","pageNum":,"rcolCode":"sum_col_25539","colType":"number","formula":"sum","id":,"position":"group","colId":,"dorder":,"pColName":"综合利率","pRcolCode":"col_25539"}],"flag":"bar1","reportId":,"reportName":"iiiissszzzV","pageSize":,"searchs":[],"orders":[],"pageNum":,"projectId":}} {"reportType":"bar1","sourceId":,"reportId":,"num":,"usedFields":"月份$$综合利率(合计)$$","projectId":,"userId":} {"responseBody":{"statusCodeValue":,"httpHeaders":{},"body":{"msg":"请求成功","reportId":,"httpCode":,"timestamp":},"statusCode":"OK"},"response":{}} --
Time taken: 2.212 seconds, Fetched: row(s)
hive>
七、常见问题与处理方法
1、FATAL: Spool Directory source skydataSource: { spoolDir: /opt/flumeSpool }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
java.nio.charset.MalformedInputException: Input length = 1
可能原因:
1、字符编码问题,spoolDir目录下的日志文件必须是UTF-8
2、使用Spooling Directory Source的时候,一定要避免同时读写一个文件的情况,conf文件增加如下配置
a1.sources.skydataSource.ignorePattern=([^_]+)|(.*(\.log)$)|(.*(\.COMPLETED)$)
2、日志导入到hadoop目录,但是hive表查询无数据。如hdfs://centos1:9000/flume/skydata_hive_log/dt=2016-12-01/下面有数据,
hive查询 select * from skydata_hive_log 却无数据
可能原因:
1、建表的时候,没有建立分区。即使flume进行了配置(a1.sinks.hdfsSink.hdfs.path=hdfs://centos1:9000/flume/skydata_hive_log/dt=%Y-%m-%d),但是表的分区结构没有建立,因此文件导入到HDFS上后,HIVE并不能读取。
解决方法:先创建分区,建立shell可执行文件,将该表的分区先建好
for ((i=-;i<=;i++))
do dt=$(date -d "$(date +%F) ${i} days" +%Y-%m-%d) echo date=$dt hive -e "ALTER TABLE skydata_hive_log ADD PARTITION(dt='${dt}')" >> logs/init_skydata_hive_log.out >>logs/init_skydata_hive_log.err done
2、也可能是文件在hdfs上还是.tmp文件,仍然被hdfs在写入。.tmp文件hive暂时无法读取,只能读取非.tmp文件。
解决方法:等待hdfs配置的roll间隔时间,或者达到一定大小后tmp文件重命名为hdfs上的日志文件后,再查询hive,即可查到。
秀才坤坤出品
转载请注明
原文地址:http://www.cnblogs.com/xckk/p/6125838.html
Flume + HDFS + Hive日志收集系统的更多相关文章
- 基于Flume的美团日志收集系统(二)改进和优化
在<基于Flume的美团日志收集系统(一)架构和设计>中,我们详述了基于Flume的美团日志收集系统的架构设计,以及为什么做这样的设计.在本节中,我们将会讲述在实际部署和使用过程中遇到的问 ...
- 基于Flume的美团日志收集系统 架构和设计 改进和优化
3种解决办法 https://tech.meituan.com/mt-log-system-arch.html 基于Flume的美团日志收集系统(一)架构和设计 - https://tech.meit ...
- 基于Flume的美团日志收集系统(一)架构和设计
美团的日志收集系统负责美团的所有业务日志的收集,并分别给Hadoop平台提供离线数据和Storm平台提供实时数据流.美团的日志收集系统基于Flume设计和搭建而成. <基于Flume的美团日志收 ...
- 基于Flume的美团日志收集系统(一)架构和设计【转】
美团的日志收集系统负责美团的所有业务日志的收集,并分别给Hadoop平台提供离线数据和Storm平台提供实时数据流.美团的日志收集系统基于Flume设计和搭建而成. <基于Flume的美团日志收 ...
- 转:基于Flume的美团日志收集系统(一)架构和设计
美团的日志收集系统负责美团的所有业务日志的收集,并分别给Hadoop平台提供离线数据和Storm平台提供实时数据流.美团的日志收集系统基于Flume设计和搭建而成. <基于Flume的美团日志收 ...
- Flume -- 开源分布式日志收集系统
Flume是Cloudera提供的一个高可用的.高可靠的开源分布式海量日志收集系统,日志数据可以经过Flume流向需要存储终端目的地.这里的日志是一个统称,泛指文件.操作记录等许多数据. 一.Flum ...
- Flume-NG + HDFS + HIVE 日志收集分析
国内私募机构九鼎控股打造APP,来就送 20元现金领取地址:http://jdb.jiudingcapital.com/phone.html内部邀请码:C8E245J (不写邀请码,没有现金送)国内私 ...
- Flume-NG + HDFS + HIVE日志收集分析
摘自:http://blog.csdn.net/cnbird2008/article/details/18967449
- Flume日志收集系统架构详解--转
2017-09-06 朱洁 大数据和云计算技术 任何一个生产系统在运行过程中都会产生大量的日志,日志往往隐藏了很多有价值的信息.在没有分析方法之前,这些日志存储一段时间后就会被清理.随着技术的发展和 ...
随机推荐
- 查看32bit的ARM(比如ARMv7)反汇编
1.使用./arm-eabi-as test.S -o test.o编译 2.使用./arm-eabi-objdump -d test.o反汇编
- [swustoj 1088] 德州扑克
德州扑克(1088) 问题描述 德州扑克是一款风靡全球的扑克游戏.德州扑克一共有52张牌,没有王牌.每个玩家分两张牌作为“底牌”,五张由荷官陆续朝上发出的作为公共牌.开始的时候,每个玩家会有两张面朝下 ...
- Java [leetcode 16] 3Sum Closest
题目描述: Given an array S of n integers, find three integers in S such that the sum is closest to a giv ...
- Java [leetcode 30]Substring with Concatenation of All Words
题目描述: You are given a string, s, and a list of words, words, that are all of the same length. Find a ...
- jquery 上传空间uploadify使用笔记
基于jquery的文件上传控件,支持ajax无刷新上传,多个文件同时上传,上传进行进度显示,删除已上传文件. 要求使用jquery1.4或以上版本,flash player 9.0.24以上. 有两个 ...
- 数学(扩展欧几里得算法):HDU 5114 Collision
Matt is playing a naive computer game with his deeply loved pure girl. The playground is a rectangle ...
- [Bhatia.Matrix Analysis.Solutions to Exercises and Problems]ExI.5.9
(Schur's Theorem) If $A$ is positive, then $$\bex \per(A)\geq \det A. \eex$$ Solution. By Exercise I ...
- 32、handler更新控件值
import android.app.Activity; import android.os.Bundle; import android.os.Handler; import android.os. ...
- [selenium webdriver Java]检查元素是否存在
Selenium WebDriver没有实现Selenium RC的isElementPresent()方法来检查页面上的元素是否存在. 在WebDriver中封装一个类似的方法,如下: public ...
- selenium+testNG+Ant
好几天没写了,抽时间写下,也好有个总结: 1.selenium+testNG+Ant (1)ant 是构建工具 他的作用就是运行你配置好的东西 而tentng.xml你可以认为他是管理test的一个配 ...