HDFS Architecture Notes

 1、Moving Computation is Cheaper than Moving Data

  A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

 2、Safemode

  On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

 3、The Persistence of File System Metadata

  FsImage、EditLog.

 4、Staging

  HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode.

 5、Replication Pipelining

  When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

 6、File Deletes and Undeletes

  When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trashdirectory.  The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

  Reference:http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

HDFS Architecture Notes的更多相关文章

  1. Hadoop官方文档翻译——HDFS Architecture 2.7.3

    HDFS Architecture HDFS Architecture(HDFS 架构) Introduction(简介) Assumptions and Goals(假设和目标) Hardware ...

  2. 【转载】Hadoop官方文档翻译——HDFS Architecture 2.7.3

    HDFS Architecture HDFS Architecture(HDFS 架构) Introduction(简介) Assumptions and Goals(假设和目标) Hardware ...

  3. HDFS Architecture

    http://hadoop.apache.org/docs/r2.9.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Introduction Ha ...

  4. API Management Architecture Notes

    Kong/Tyk/Zuul/strongloop/Ambassador/Gravitee IBM Reference Architecture for API Management: https:// ...

  5. HDFS 与 GFS 的设计差异

    后端分布式系列」前面关于 HDFS 的一些文章介绍了它的整体架构和一些关键部件的设计实现要点. 我们知道 HDFS 最早是根据 GFS(Google File System)的论文概念模型来设计实现的 ...

  6. HDFS 异常处理与恢复

    在前面的文章 <HDFS DataNode 设计实现解析>中我们对文件操作进行了描述,但并未展开讲述其中涉及的异常错误处理与恢复机制.本文将深入探讨 HDFS 文件操作涉及的错误处理与恢复 ...

  7. HDFS Client 设计实现解析

    前面对 HDFS NameNode 和 DataNode 的架构设计实现要点做了介绍,本文对 HDFS 最后一个主要构成组件 Client 做进一步解析. 流式读取 HDFS Client 为客户端应 ...

  8. HDFS DataNode 设计实现解析

    前文分析了 NameNode,本文进一步解析 DataNode 的设计和实现要点. 文件存储 DataNode 正如其名是负责存储文件数据的节点.HDFS 中文件的存储方式是将文件按块(block)切 ...

  9. HDFS NameNode 设计实现解析

    接前文 分布式存储-HDFS 架构解析,我们总体分析了 HDFS 架构的主要构成组件包括:NameNode.DataNode 和 Client.本文首先进一步解析 HDFS NameNode 的设计和 ...

随机推荐

  1. Chrome即将封杀Google Earth、Google Talk等插件

    昨日,Chrome安全工程师Justin Schuh在官方博客中写道,到明年一月份,谷歌将封杀一系列基于NPAPI框架标准的浏览器插件.其中包括谷歌地球(Google Earth).Google Ta ...

  2. 集成学习之Boosting —— Gradient Boosting实现

    Gradient Boosting的一般算法流程 初始化: \(f_0(x) = \mathop{\arg\min}\limits_\gamma \sum\limits_{i=1}^N L(y_i, ...

  3. Java判断String类型变量是否可以转换数字类型

    正则表达式 首先要import java.util.regex.Pattern 和 java.util.regex.Matcher public boolean isNumeric(String st ...

  4. UML类图中的各种箭头代表的含义(转自:http://www.cnblogs.com/damsoft/archive/2016/10/24/5993602.html)

    1.UML简介Unified Modeling Language (UML)又称统一建模语言或标准建模语言. 简单说就是以图形方式表现模型,根据不同模型进行分类,在UML 2.0中有13种图,以下是他 ...

  5. LNMP环境下独立安装Mysql5.7.18 并对数据库文件进行本地物理迁移 (需暂停数据库服务方式)

    前几天读研时候的同学要我帮忙给解决一个问题,就是Redhat服务器下面安装了LNMP,并且由于分区的划分不当导致MySQL数据库中存放数据库的盘区内空间被急剧消耗,由于该应用主要是数据分析及备份所用, ...

  6. OpenCV - Linux(Ubuntu 16.04)中安装OpenCV + OpenCV_Contrib

    近两个月来接触了Linux系统,在老板的建议下翻了Ubuntu的牌子,我安装的版本是16.04,用习惯之后感觉蛮好的,比Windows要强.好啦,废话不说啦,下面开始说在Ubuntu中安装OpemCV ...

  7. POI2015题解

    POI2015题解 吐槽一下为什么POI2015开始就成了破烂波兰文题目名了啊... 咕了一道3748没写打表题没什么意思,还剩\(BZOJ\)上的\(14\)道题. [BZOJ3746][POI20 ...

  8. html页面控制字体大小的js代码

    dom对象控制显示文章字体大小的js代码 <head> <script type="text/javascript"> function check(siz ...

  9. streamsets 数据流设计

    streamsets 支持branch(分支)&& merge(合并)模式的数据流 branch 数据流 如下图: 我们可以根据数据包含的字段进行拆分,不同的数据流处理自己关注的数据 ...

  10. 获取bing带swim的网址列表

    需求背景: 应老婆要求,搜集带有swim关键字的网站.实现过程: 使用requests模块通过bing接口搜索swim关键,将返回内容按需求进行处理,得到网站列表. 注:代码比较拙,老司机就不要弄废时 ...