1、Region数量的影响

通常较少的region数量可使群集运行的更加平稳,官方指出每个RegionServer大约100个regions的时候效果最好,理由如下:
1)Hbase的一个特性MSLAB,它有助于防止堆内存的碎片化,减轻垃圾回收Full GC的问题,默认是开启的。但是每个MemStore需要2MB(一个列簇对应一个写缓存memstore)。所以如果每个region有2个family列簇,总有1000个region,就算不存储数据也要3.95G内存空间.
2)如果很多region,它们中Memstore也过多,内存大小触发Region Server级别限制导致flush,就会对用户请求产生较大的影响,可能阻塞该Region Server上的更新操作。
3)HMaster要花大量的时间来分配和移动Region,且过多Region会增加ZooKeeper的负担。
4)从hbase读入数据进行处理的mapreduce程序,过多Region会产生太多Map任务数量,默认情况下一个region对应一个map
所以,如果一个HRegion中Memstore过多,而且大部分都频繁写入数据,每次flush的开销必然会很大,因此也建议在进行表设计的时候尽量减少ColumnFamily的个数。
 
计算集群region数量的公式:
((RS Xmx) * hbase.regionserver.global.memstore.size) / (hbase.hregion.memstore.flush.size * (# column families))
假设一个RS有16GB内存,那么16384*0.4/128m 等于51个活跃的region。
如果写很重的场景下,可以适当调高hbase.regionserver.global.memstore.size,这样可以容纳更多的region数量。
总结,建议分配合理的region数量,根据写请求量的情况,一般20-200个之间,可以提高集群稳定性,排除很多不确定的因素,提升读写性能。监控Region Server中所有Memstore的大小总和是否达到了上限(hbase.regionserver.global.memstore.upperLimit * hbase_heapsize,默认 40%的JVM内存使用量),超过可能会导致不良后果,如服务器反应迟钝或compact风暴。
 

2、region大小的影响

hbase中数据一开始会写入memstore,满128MB(看配置)以后,会flush到disk上而成为storefile。当storefile数量超过触发因子时(可以配置),会启动compaction过程将它们合并为一个storefile。对集群的性能有一定影响。而当合并后的storefile大于max.filesize,会触发分割动作,将它切分成两个region。
1)当hbase.hregion.max.filesize比较小时,触发split的机率更大,系统的整体访问服务会出现不稳定现象。
2)当hbase.hregion.max.filesize比较大时,由于长期得不到split,因此同一个region内发生多次compaction的机会增加了。这样会降低系统的性能、稳定性,因此平均吞吐量会受到一些影响而下降。
总结,hbase.hregion.max.filesize不宜过大或过小,经过实战,生产高并发运行下,最佳大小5-10GB!关闭某些重要场景的hbase表的major_compact!在非高峰期的时候再去调用major_compact,这样可以减少split的同时,显著提供集群的性能,吞吐量、非常有用。
 

169.2.2. Number of regions per RS - upper bound

In production scenarios, where you have a lot of data, you are normally concerned with the maximum number of regions you can have per server. too many regions has technical discussion on the subject. Basically, the maximum number of regions is mostly determined by memstore memory usage. Each region has its own memstores; these grow up to a configurable size; usually in 128-256 MB range, see hbase.hregion.memstore.flush.size. One memstore exists per column family (so there’s only one per region if there’s one CF in the table). The RS dedicates some fraction of total memory to its memstores (see hbase.regionserver.global.memstore.size). If this memory is exceeded (too much memstore usage), it can cause undesirable consequences such as unresponsive server or compaction storms. A good starting point for the number of regions per RS (assuming one table) is:

((RS memory) * (total memstore fraction)) / ((memstore size)*(# column families))

This formula is pseudo-code. Here are two formulas using the actual tunable parameters, first for HBase 0.98+ and second for HBase 0.94.x.

HBase 0.98.x
((RS Xmx) * hbase.regionserver.global.memstore.size) / (hbase.hregion.memstore.flush.size * (# column families))
HBase 0.94.x
((RS Xmx) * hbase.regionserver.global.memstore.upperLimit) / (hbase.hregion.memstore.flush.size * (# column families))+

If a given RegionServer has 16 GB of RAM, with default settings, the formula works out to 16384*0.4/128 ~ 51 regions per RS is a starting point. The formula can be extended to multiple tables; if they all have the same configuration, just use the total number of families.

This number can be adjusted; the formula above assumes all your regions are filled at approximately the same rate. If only a fraction of your regions are going to be actively written to, you can divide the result by that fraction to get a larger region count. Then, even if all regions are written to, all region memstores are not filled evenly, and eventually jitter appears even if they are (due to limited number of concurrent flushes). Thus, one can have as many as 2-3 times more regions than the starting point; however, increased numbers carry increased risk.

For write-heavy workload, memstore fraction can be increased in configuration at the expense of block cache; this will also allow one to have more regions.

169.2.3. Number of regions per RS - lower bound

HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle. This really can’t be stressed enough, since a common problem is loading 200MB data into HBase and then wondering why your awesome 10 node cluster isn’t doing anything.

On the other hand, if you have a very large amount of data, you may also want to go for a larger number of regions to avoid having regions that are too large.

169.2.4. Maximum region size

For large tables in production scenarios, maximum region size is mostly limited by compactions - very large compactions, esp. major, can degrade cluster performance. Currently, the recommended maximum region size is 10-20Gb, and 5-10Gb is optimal. For older 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb.

The size at which the region is split into two is generally configured via hbase.hregion.max.filesize; for details, see arch.region.splits.

If you cannot estimate the size of your tables well, when starting off, it’s probably best to stick to the default region size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).

In HBase 0.98, experimental stripe compactions feature was added that would allow for larger regions, especially for log data. See ops.stripe.

169.2.5. Total data size per region server

According to above numbers for region size and number of regions per region server, in an optimistic estimate 10 GB x 100 regions per RS will give up to 1TB served per region server, which is in line with some of the reported multi-PB use cases. However, it is important to think about the data vs cache size ratio at the RS level. With 1TB of data per server and 10 GB block cache, only 1% of the data will be cached, which may barely cover all block indices.

 

hbase集群region数量和大小的影响的更多相关文章

  1. 读者来信-5 | 如果你家HBase集群Region太多请点进来看看,这个问题你可能会遇到

    前言:<读者来信>是HBase老店开设的一个问答专栏,旨在能为更多的小伙伴解决工作中常遇到的HBase相关的问题.老店会尽力帮大家解决这些问题或帮你发出求救贴,老店希望这会是一个互帮互助的 ...

  2. 读者来信 | 如果你家HBase集群Region太多请点进来看看,这个问题你可能会遇到

    前言:<读者来信>是HBase老店开设的一个问答专栏,旨在能为更多的小伙伴解决工作中常遇到的HBase相关的问题.老店会尽力帮大家解决这些问题或帮你发出求救贴,老店希望这会是一个互帮互助的 ...

  3. hbase集群安装与部署

    1.相关环境 centos7 hadoop2.6.5 zookeeper3.4.9 jdk1.8 hbase1.2.4 本篇文章仅涉及hbase集群的搭建,关于hadoop与zookeeper的相关部 ...

  4. Hbase集群搭建及所有配置调优参数整理及API代码运行

    最近为了方便开发,在自己的虚拟机上搭建了三节点的Hadoop集群与Hbase集群,hadoop集群的搭建与zookeeper集群这里就不再详细说明,原来的笔记中记录过.这里将hbase配置参数进行相应 ...

  5. Hbase集群监控

    Hbase集群监控 Hbase Jmx监控 监控每个regionServer的总请求数,readRequestsCount,writeRequestCount,region分裂,region合并,St ...

  6. 基于docker快速搭建hbase集群

    一.概述 HBase是一个分布式的.面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文"Bigtable:一个结构化数据的分布式存储系统".就像Bigt ...

  7. Hadoop hbase集群断电数据块被破坏无法启动

    集群机器意外断电重启,导致hbase 无法正常启动,抛出reflect invocation异常,可能是正在执行的插入或合并等操作进行到一半时中断,导致部分数据文件不完整格式不正确或在hdfs上blo ...

  8. Apache HBase 集群安装文档

    简介: Apache HBase 是一个分布式的.面向列的开源 NoSQL 数据库.具有高性能.高可靠性.可伸缩.面向列.分布式存储的特性. HBase 的数据文件最终落地在 HDFS 之上,所以在 ...

  9. 使用Hbase快照将数据输出到互联网区测试环境的临时Hbase集群

    通过snapshot对内网测试环境Hbase生产集群的全量数据(包括原始数据和治理后数据)复制到互联网Hbase临时集群.工具及原理: 1)         Hbase自带镜像导出工具(snapsho ...

随机推荐

  1. javascript之正则表达式(一)

    正则表达式:定义一套规则,检查字符串的用的.换句话说,就是记录文本规则的代码.适用于进行文字匹配工具,例如:(1)测试字符串的某个模式(2)替换文本(3)根据模式匹配从字符串中提取一个子字符串.语法: ...

  2. import tensorflow 报错

    >>> import tensorflowe:\ProgramData\Anaconda3\lib\site-packages\h5py\__init__.py:36: Future ...

  3. springboot简单入门笔记

    一.Spring Boot 入门 1.Spring Boot 简介 简化Spring应用开发的一个框架: 整个Spring技术栈的一个大整合: J2EE开发的一站式解决方案: 2.微服务 2014,m ...

  4. Excel催化剂开源第19波-一些虽简单但不知道时还是很难受的知识点

    通常许多的知识都是在知与不知之间,不一定非要很深奥,特别是Excel这样的应用工具层面,明明已经摆在那里,你不知道时,永远地不知道,知道了,简单学习下就已经实现出最终的功能效果. 在程序猿世界里,也是 ...

  5. sklearn使用技巧

    sklearn使用技巧 sklearn上面对自己api的解释已经做的淋漓尽致,但对于只需要短时间入手的同学来说,还是比较复杂的,下面将会列举sklearn的使用技巧. 预处理 主要在sklearn.p ...

  6. 【Java高级】(一)JVM

    5.2.1.在Java中如何判断对象已死? 引用计数算法 给对象中添加一个引用计数器,每当有一个地方引用它时,计数器值就加一1:当引用失效时,计数器值就减1:任何时刻计数器为0的对象就是不可能被使用的 ...

  7. C#多线程学习之如何操纵一个线程

    下面我们就动手来创建一个线程,使用Thread类创建线程时,只需提供线程入口即可.(线程入口使程序知道该让这个线程干什么事) 在C#中,线程入口是通过ThreadStart代理(delegate)来提 ...

  8. 7月新的开始 - LayUI的基本使用 - Tab选项卡切换显示对应数据

    LayUI tab选项卡+page展示 要求:实现tab选项卡改变的同时展示数据也跟着改变 实现条件: 1. 选项卡 [官网 – 文档/示例 – 页面元素 – 选项卡] 2.数据表格 [官网 – 文档 ...

  9. 上传及下载github项目

    1.上传本地项目 git init //把这个目录变成Git可以管理的仓库         git add README.md //文件添加到仓库         git add . //不但可以跟单 ...

  10. Redis(六)--- Redis过期策略与内存淘汰机制

    1.简述 关于Redis键的过期策略,首先要了解两种时间的区别,生存时间和过期时间: 生存时间:一段时长,如30秒.6000毫秒,设置键的生存时间就是设置这个键可以存在多长时间,命令有两个 expir ...