HDFS Scribe Integration 【转】
It is finally here: you can configure the open source log-aggregator, scribe, to log data directly into the Hadoop distributed file system.
Many Web 2.0 companies have to deploy a bunch of costly filers to capture weblogs being generated by their application. Currently, there is no option other than a costly filer because the write-rate for this stream is huge. The Hadoop-Scribe integration allows this write-load to be distributed among a bunch of commodity machines, thus reducing the total cost of this infrastructure.
The challenge was to make HDFS be real-timeish in behaviour. Scribe uses libhdfs which is the C-interface to the HDFs client. There were various bugs in libhdfs that needed to be solved first. Then came the FileSystem API. One of the major issues was that the FileSystem API caches FileSystem handles and always returned the same FileSystem handle when called from multiple threads. There was no reference counting of the handle. This caused problems with scribe, because Scribe is highly multi-threaded. A new API FileSystem.newInstance() was introduced to support Scribe.
Making the HDFS write code path more real-time was painful. There are various timeouts/settings in HDFS that were hardcoded and needed to be changed to allow the application to fail fast. At the bottom of this blog-post, I am attaching the settings that we have currently configured to make the HDFS-write very real-timeish. The last of the JIRAS, HADOOP-2757 is in the pipeline to be committed to Hadoop trunk very soon.
What about Namenode being the single point of failure? This is acceptable in a warehouse type of application but cannot be tolerated by a realtime application. Scribe typically aggregates click-logs from a bunch of webservers, and losing *all* click log data of a website for a 10 minutes or so (minimum time for a namenode restart) cannot be tolerated. The solution is to configure two overlapping clusters on the same hardware. Run two separate namenodes N1 and N2 on two different machines. Run one set of datanode software on all slave machines that report to N1 and the other set of datanode software on the same set of slave machines that report to N2. The two datanode instances on a single slave machine share the same data directories. This configuration allows HDFS to be highly available for writes!
The highly-available-for-writes-HDFS configuration is also required for software upgrades on the cluster. We can shutdown one of the overlapping HDFS clusters, upgrade it to new hadoop software, and then put it back online before starting the same process for the second HDFS cluster.
What are the main changes to scribe that were needed? Scribe already had the feature that it buffers data when it is unable to write to the configured storage. The default scribe behaviour is to replay this buffer back to the storage when the storage is back online. Scribe is configured to support no-buffer-replay when the primary storage is back online. Scribe-hdfs is configured to write data to a cluster N1 and if N1 fails then it writes data to cluster N2. Scribe treats N1 and N2 as two equivalent primary stores.
转自:http://hadoopblog.blogspot.hk/2009/06/hdfs-scribe-integration.html
HDFS Scribe Integration 【转】的更多相关文章
- Scribe+HDFS日志收集系统安装方法
1.概述 Scribe是facebook开源的日志收集系统,可用于搜索引擎中进行大规模日志分析处理.其通常与Hadoop结合使用,scribe用于向HDFS中push日志,而Hadoop通过MapRe ...
- 【转载】scribe、chukwa、kafka、flume日志系统对比
原文地址:http://www.ttlsa.com/log-system/scribe-chukwa-kafka-flume-log-system-contrast/ 1. 背景介绍许多公司的平台每天 ...
- Scribe日志收集工具
Scribe日志收集工具 概述 Scribe是facebook开源的日志收集系统,在facebook内部已经得到大量的应用.它能够从各种日志源上收集日志,存储到一个中央存储系统(可以是NFS,分布式文 ...
- Linux System Log Collection、Log Integration、Log Analysis System Building Learning
目录 . 为什么要构建日志系统 . 通用日志系统的总体架构 . 日志系统的元数据来源:data source . 日志系统的子安全域日志收集系统:client Agent . 日志系统的中心日志整合系 ...
- syslog syslog-ng rsyslog flume scribe 各种尝试
1. syslog概念 syslog本身是一种协议, 一个用来描述系统日志格式的协议, 当前的协议包括三部分: 如下面是一个syslog消息: <30>Oct 9 22:33:20 hlf ...
- kettle连接hadoop&hdfs图文详解
1 引言: 项目最近要引入大数据技术,使用其处理加工日上网话单数据,需要kettle把源系统的文本数据load到hadoop环境中 2 准备工作: 1 首先 要了解支持hadoop的Kettle版本情 ...
- scribe、chukwa、kafka、flume日志系统对比 -摘自网络
1. 背景介绍许多公司的平台每天会产生大量的日志(一般为流式数据,如,搜索引擎的pv,查询等),处理这些日志需要特定的日志系统,一般而言,这些系统需要具有以下特征:(1) 构建应用系统和分析系统的桥梁 ...
- Loading Data into HDFS
How to use a PDI job to move a file into HDFS. Prerequisites In order to follow along with this how- ...
- 大数据应用日志采集之Scribe演示实例完全解析
大数据应用日志采集之Scribe演示实例完全解析 引子: Scribe是Facebook开源的日志收集系统,在Facebook内部已经得到大量的应用.它能够从各种日志源上收集日志,存储到一个中央存储系 ...
随机推荐
- windows服务自动备份数据库
最近写了几个windows服务用于自动备份与删除数据: services代码如下: public partial class Service1 : ServiceBase { public Servi ...
- Python框架之Django学习笔记(七)
标签 eif/else {% if %} 标签检查(evaluate)一个变量,如果这个变量为真(即,变量存在,非空,不是布尔值假),系统会显示在 {% if %} 和 {% endif %} 之间的 ...
- uReplicator实现分析
MirrorMakerWorker分析 是整个同步机制的主入口,主要组织的逻辑有: 配置数据的传入与处理,ConsumerConfig对象的构建 度量对象的准备,定时上报的度量数据收集线程的定义与启动 ...
- Python面向对象之常用的特殊方法(5)
Python面向对象里面有很多特殊方法,例如__init__(构造方法),__del__(析构方法),这些方法对于面向对象编程非常重要,下面列出一些常用的特殊方法 (1)__call__ class ...
- tcp协议 tcpip协议 http协议,IP,DNS,端口号
每当看到HTTP协议,tcp/ip协议,TCP协议总是蒙圈:在这里先记录一下,方面自己查看 TCP协议:TCP(Transmission Control Protocol 传输控制协议)是一种面向连接 ...
- [luogu1707] 刷题比赛 [矩阵快速幂]
题面: 传送门 思路: 一眼看上去是三个递推......好像还挺麻烦的 仔细观察一下,发现也就是一个线性递推,但是其中后面的常数项比较麻烦 观察一下,这里面有以下三个递推是比较麻烦的 第一个是$k^2 ...
- [bzoj3944] sum [杜教筛模板]
题面: 传送门 就是让你求$ \varphi\left(i\right) $以及$ \mu\left(i\right) $的前缀和 思路: 就是杜教筛的模板 我们把套路公式拿出来: $ g\left( ...
- BZOJ3997 [TJOI2015]组合数学 【Dilworth定理】
题目 给出一个网格图,其中某些格子有财宝,每次从左上角出发,只能向下或右走.问至少走多少次才能将财宝捡完.此对此问题变形,假设每个格子中有好多财宝,而每一次经过一个格子至多只能捡走一块财宝,至少走多少 ...
- [BZOJ2957] 楼房重建 (线段树,递归)
题目链接 Solution 经典的一道线段树题,难点在于如何合并节点. 由于题目要求直线要求不相交,则斜率均大于前面的点即为答案. 所以以斜率为权值. 考虑线段树每一个节点维护两个值: \(Max\) ...
- [暑假集训--数论]poj1061 青蛙的约会
Description 两只青蛙在网上相识了,它们聊得很开心,于是觉得很有必要见一面.它们很高兴地发现它们住在同一条纬度线上,于是它们约定各自朝西跳,直到碰面为止.可是它们出发之前忘记了一件很重要的事 ...