HDFS 上的文件对应的 Block 保存多个副本，且提供容错机制，副本丢失或者宕机自动恢复，默认是存 3 个副本。

2.8.x之前的副本策略

官方文档说明：

https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

第一副本：放置在上传文件的 DataNode 上；如果是集群外提交，则随机挑选一个磁盘不太慢、CPU 不太忙的节点。

第二副本：放置在与第一个副本相同的机架的节点上。

第三副本：与第二个副本相同机架的不同节点上。

如果还有更多的副本：随机放在节点上，同时需要保持每个机架的副本数低于上限，基本上是((replicas - 1) / racks + 2）。

因为 NameNode 不允许 DataNodes 拥有同一个 block 的多个副本，所以能创建的最大副本数就是当时 DataNodes 的总数。

2.9.x之后及3.x的副本策略

官方文档说明：

https://hadoop.apache.org/docs/r2.9.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

第一副本：放置在上传文件的 DataNode 上；如果是集群外提交，则随机挑选一个磁盘不太慢、CPU 不太忙的节点。

第二副本：放置在与第一个副本不同的机架的节点上。

第三副本：与第二个副本相同机架的不同节点上。

如果还有更多的副本：随机放在节点上，同时需要保持每个机架的副本数低于上限，基本上是((replicas - 1) / racks + 2）。

因为 NameNode 不允许 DataNodes 拥有同一个 block 的多个副本，所以能创建的最大副本数就是当时 DataNodes 的总数。

Hadoop2.x与Hadoop3.x副本选择机制的更多相关文章

图文了解 Kafka 的副本复制机制
让分布式系统的操作变得简单,在某种程度上是一种艺术,通常这种实现都是从大量的实践中总结得到的.Apache Kafka 的受欢迎程度在很大程度上归功于其设计和操作简单性.随着社区添加更多功能,开发者们 ...
Go版本依赖--版本选择机制
目录 1. 版本选择机制 2.依赖包版本约定 2.1 Go module 之前版本兼容性 2.2 Go module 之后版本兼容性 3. 版本选择机制 3.1 最新版本选择 3.2 最小版本选择 1 ...
Hadoop_HDFS文件读写代码流程解析和副本存放机制
Hadoop学习笔记总结 01.RPC(远程过程调用) 1. RPC概念远程过程指的不是同一个进程的调用.它是一种通过网络从远程计算机程序上请求服务,而不需要了解底层网络技术的协议. 不能直接拿到远 ...
Kafka 0.8 副本同步机制理解
Kafka的普及在很大程度上归功于它的设计和操作简单,如何自动调优Kafka副本的工作,挑战之一:如何避免follower进入和退出同步副本列表(即ISR).如果某些topic的部分partition ...
Kafka副本同步机制
引用自:http://blog.csdn.net/lizhitao/article/details/51718185 Kafka副本 Kafka中主题的每个Partition有一个预写式日志文件,每个 ...
大数据入门基础系列之Hadoop1.X、Hadoop2.X和Hadoop3.X的多维度区别详解（博主推荐）
不多说,直接上干货! 在前面的博文里,我已经介绍了大数据入门基础系列之Linux操作系统简介与选择大数据入门基础系列之虚拟机的下载.安装详解大数据入门基础系列之Linux的安装详解大数据入门基 ...
font and face, 浅探Emacs字体选择机制及部分记录
缘起最近因为仰慕org-mode,从vim迁移到了Emacs.偶然发现org-mode中调出的calendar第一行居然没有对齐,排查一下发现是字体的问题.刚好也想改改Emacs的字体,于是我就开始 ...
大数据篇：HDFS
HDFS HDFS是什么? Hadoop分布式文件系统(HDFS)是指被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统(Distributed File Syste ...
深入理解 Kafka 副本机制
一.Kafka集群二.副本机制 2.1 分区和副本 2.2 ISR机制 2.3 不完全的首领选举 2.4 最少同步副本 ...

随机推荐

Solution -「SV 2020 Round I」「SRM 551 DIV1」「TC 12141」SweetFruits
\(\mathcal{Description}\) link. 给定 \(n\) 个水果,每个结点可能有甜度 \(v_i\),或不甜(\(v_i=-1\)).现在把这些水果串成一棵无根树.称一 ...
Solution -「ARC 104E」Random LIS
\(\mathcal{Description}\) Link. 给定整数序列 \(\{a_n\}\),对于整数序列 \(\{b_n\}\),\(b_i\) 在 \([1,a_i]\) 中等概率 ...
VS Code 在线运行：code-server部署(系列一)
偶然间发现VS Code可以在线运行,闲来无事部署了一套试试效果,先上图,是不是很熟悉的感觉,初步体验基本跟本地运行效果一样. 进入正题首先要准备一套linux环境,环境配置最低是1G内存.2核CP ...
Linux系统日志清除实验
实验目的 1.了解Linux日志的作用. 2.掌握删除Linux日志的方法. 实验原理所谓日志(Log)是指系统所指定对象的某些操作和其操作结果按时间有序的集合.每个日志文件由日志记录组成,每条日志 ...
[旧][Android] Retrofit 源码分析之 ServiceMethod 对象
备注原发表于2016.05.03,资料已过时,仅作备份,谨慎参考前言大家好,我又来学习 Retrofit 了,可能这是最后一篇关于 Retrofit 框架的文章了.我发现源码分析这回事,当时看明 ...
还在争论WPS、Office哪个更好用？这款云办公工具才是真的香！
最近,金山WPS更新狠狠的刷了一波存在感.尤其是xlookup函数,着实是有被惊艳到,也让大家看到了国产办公软件的进步.甚至有人认为WPS已经超越了传统的办公软件--微软office.WPS的优点固然 ...
4款BI工具功能大对比，这款BI工具你不能错过！
在这个信息爆炸式增长的时代,挖掘数据的潜在价值显得尤为重要.越来越多的人将目光聚集于商业智能领域--BI(Business Intelligence),通过数据分析软件对来自不同的数据源进行统一的处理 ...
Smartbi集成性怎么样，是否方便与已有的Web应用集成？
Smartbi产品具有强大的集成能力,它采用纯JAVA开发,支持J2EE系统的嵌入式部署,它对外提供所有功能的API访问接口,可以实现灵活的控制,能够方便无缝与已有的Web应用进行集成. 支持丰富 ...
Scala学习笔记（详细）
第2章变量 val,var,声明变量必须初始化:变量类型确定后不可更改数据类型:与java有相同的数据类型,在scala中数据类型都是对象特殊类型:Unit:表示无值,只有一个实例值写出(),相 ...
Python：绘图添加中文标题
(20条消息) Python绘图如何显示中文标题_wulei_1107103372的博客-CSDN博客_python画图中文标题 plt.rcParams['font.sans-serif'] = [ ...

Hadoop2.x与Hadoop3.x副本选择机制

2.8.x之前的副本策略

2.9.x之后及3.x的副本策略

Hadoop2.x与Hadoop3.x副本选择机制的更多相关文章

随机推荐

热门专题