HDFS Architecture Notes
【HDFS Architecture Notes】
1、Moving Computation is Cheaper than Moving Data
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
2、Safemode
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.
3、The Persistence of File System Metadata
FsImage、EditLog.
4、Staging
HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode.
5、Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
6、File Deletes and Undeletes
When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trashdirectory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.
Reference:http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
HDFS Architecture Notes的更多相关文章
- Hadoop官方文档翻译——HDFS Architecture 2.7.3
HDFS Architecture HDFS Architecture(HDFS 架构) Introduction(简介) Assumptions and Goals(假设和目标) Hardware ...
- 【转载】Hadoop官方文档翻译——HDFS Architecture 2.7.3
HDFS Architecture HDFS Architecture(HDFS 架构) Introduction(简介) Assumptions and Goals(假设和目标) Hardware ...
- HDFS Architecture
http://hadoop.apache.org/docs/r2.9.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Introduction Ha ...
- API Management Architecture Notes
Kong/Tyk/Zuul/strongloop/Ambassador/Gravitee IBM Reference Architecture for API Management: https:// ...
- HDFS 与 GFS 的设计差异
后端分布式系列」前面关于 HDFS 的一些文章介绍了它的整体架构和一些关键部件的设计实现要点. 我们知道 HDFS 最早是根据 GFS(Google File System)的论文概念模型来设计实现的 ...
- HDFS 异常处理与恢复
在前面的文章 <HDFS DataNode 设计实现解析>中我们对文件操作进行了描述,但并未展开讲述其中涉及的异常错误处理与恢复机制.本文将深入探讨 HDFS 文件操作涉及的错误处理与恢复 ...
- HDFS Client 设计实现解析
前面对 HDFS NameNode 和 DataNode 的架构设计实现要点做了介绍,本文对 HDFS 最后一个主要构成组件 Client 做进一步解析. 流式读取 HDFS Client 为客户端应 ...
- HDFS DataNode 设计实现解析
前文分析了 NameNode,本文进一步解析 DataNode 的设计和实现要点. 文件存储 DataNode 正如其名是负责存储文件数据的节点.HDFS 中文件的存储方式是将文件按块(block)切 ...
- HDFS NameNode 设计实现解析
接前文 分布式存储-HDFS 架构解析,我们总体分析了 HDFS 架构的主要构成组件包括:NameNode.DataNode 和 Client.本文首先进一步解析 HDFS NameNode 的设计和 ...
随机推荐
- Nginx笔记02-nginx常用参数配置说明
nginx的主配置文件是nginx.conf,这里主要针对这个文件进行说明 1.主配置文件nginx.conf 2.nginx配置文件的结构 从上面的配置文件中我们可以总结出nginx配置文件的基 ...
- 【LeetCode 232_数据结构_队列_实现】Implement Queue using Stacks
class Queue { public: // Push element x to the back of queue. void push(int x) { while (!nums.empty( ...
- erhai系统使用_web
使用前说明1.安装mysql数据库,安装数据库管理器EMS(SQL Manager Lite for MySQL),将数据库导入数据库管理器: 注意对配置文件my.ini的修改.2.启动resin W ...
- 判断一棵树是否为二叉搜索树(二叉排序树) python
输入一棵树,判断这棵树是否为二叉搜索树.首先要知道什么是排序二叉树,二叉排序树是这样定义的,二叉排序树或者是一棵空树,或者是具有下列性质的二叉树: (1)若左子树不空,则左子树上所有结点的值均小于它的 ...
- LambdaMART简介——基于Ranklib源码(一 lambda计算)
学习Machine Learning,阅读文献,看各种数学公式的推导,其实是一件很枯燥的事情.有的时候即使理解了数学推导过程,也仍然会一知半解,离自己写程序实现,似乎还有一道鸿沟.所幸的是,现在很多主 ...
- Packer 基本试用
安装 使用mac 系统 https://www.packer.io/downloads.html 配置环境变量 可选 sudo nano ~/.bash_profile export PATH=$PA ...
- visual studio内置“iis”组件提取及二次开发
简介 visual studio安装后会自带小型的“iis”服务器,本文就简单提取一下这个组件,自己做一个小型“iis”服务器吧.先来说用途吧(废话可绕过),比如在服务器上没有安装iis,或者给客户演 ...
- Java-Runoob-高级教程-实例-字符串:05. Java 实例 - 字符串反转
ylbtech-Java-Runoob-高级教程-实例-字符串:05. Java 实例 - 字符串反转 1.返回顶部 1. Java 实例 - 字符串反转 Java 实例 以下实例演示了如何使用 J ...
- [html][javascript]父子窗体传值
父窗体 <script type="text/javascript"> newwindow = window.open("b1.html",&quo ...
- Samba服务创建共享文件系统
Linux 系统中的Samba Linux系统中的Samba服务器又提供了另外一种技术来弥补这种安全性的不足的技术,那就是采用账户映射方式为Samba服务器提供虚拟账户(不与Linux系统中的用户账户 ...