使用CopyTable工具方法在线备份HBase表
CopyTable is a simple Apache HBase utility that, unsurprisingly, can be used for copying individual tables within an HBase cluster or from one HBase cluster to another. In this blog post, we’ll talk about what this tool is, why you would want to use it, how
to use it, and some common configuration caveats.
Use cases:
CopyTable is at its core an Apache Hadoop MapReduce job that uses the standard HBase Scan read-path interface to read records from an individual table and writes them to another table (possibly on a separate cluster) using the standard HBase Put write-path
interface. It can be used for many purposes:
- Internal copy of a table (Poor man’s snapshot)
- Remote HBase instance backup
- Incremental HBase table copies
- Partial HBase table copies and HBase table schema changes
Assumptions and limitations:
The CopyTable tool has some basic assumptions and limitations. First, if being used in the multi-cluster situation, both clusters must be online and the target instance needs to have the target table present with the same column families defined as the source
table.
Since the tool uses standards scans and puts, the target cluster doesn’t have to have the same number of nodes or regions. In fact, it can have different numbers of tables, different numbers of region servers, and could have completely different region split
boundaries. Since we are copying entire tables, you can use performance optimization settings like setting larger scanner caching values for more efficiency. Using the put interface also means that copies can be made between clusters of different minor versions.
(0.90.4 -> 0.90.6, CDH3u3 -> CDH3u4) or versions that are wire compatible (0.92.1 -> 0.94.0).
Finally, HBase only provides row-level ACID guarantees; this means while a CopyTable is going on, newly inserted or updated rows may occur and these concurrent edits will either be completely included or completely excluded. While rows will be consistent, there
is no guarantees about the consistency, causality, or order of puts on the other rows.
Internal copy of a table (Poor man’s snapshot)
Versions of HBase up to and including the most recent 0.94.x versions do not support table snapshotting. Despite HBase’s ACID limitations, CopyTable can be used as a naive snapshotting mechanism that makes a physical copy of a particular table.
Let’s say that we have a table, tableOrig with column-families cf1 and cf2. We want to copy all its data to tableCopy. We need to first create tableCopy with the same column families:
dstCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell
We can then create and copy the table with a new name on the same HBase instance:
srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy tableOrig
This starts an MR job that will copy the data.
Remote HBase instance backup
Let’s say we want to copy data to another cluster. This could be a one-off backup, a periodic job or could be for bootstrapping for cross-cluster replication. In this example, we’ll have two separate clusters: srcCluster and dstCluster.
In this multi-cluster case, CopyTable is a push process — your source will be the HBase instance your current hbase-site.xml refers to and the added arguments point to the destination cluster and table. This also assumes that all of the MR TaskTrackers can
access all the HBase and ZK nodes in the destination cluster. This mechanism for configuration also means that you could run this as a job on a remote cluster by overriding the hbase/mr configs to use settings from any accessible remote cluster and specify
the ZK nodes in the destination cluster. This could be useful if you wanted to copy data from an HBase cluster with lower SLAs and didn’t want to run MR jobs on them directly.
You will use the the –peer.adr setting to specify the destination cluster’s ZK ensemble (e.g. the cluster you are copying to). For this we need the ZK quorum’s IP and port as well as the HBase root ZK node for our HBase instance. Let’s say one of these machine
is srcClusterZK (listed in hbase.zookeeper.quorum) and that we are using the default zk client port 2181 (hbase.zookeeper.property.clientPort) and the default ZK znode parent /hbase (zookeeper.znode.parent). (Note: If you had two HBase instances using the
same ZK, you’d need a different zookeeper.znode.parent for each cluster.
# create new tableOrig on destination cluster
dstCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell
# on source cluster run copy table with destination ZK quorum specified using --peer.adr
# WARNING: In older versions, you are not alerted about any typo in these arguments!
srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase tableOrig
Note that you can use the –new.name argument with the –peer.adr to copy to a differently named table on the dstCluster.
# create new tableCopy on destination cluster
dstCluster$ echo "create 'tableCopy', 'cf1', 'cf2'" | hbase shell
# on source cluster run copy table with destination --peer.adr and --new.name arguments.
srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase --new.name=tableCopy tableOrig
This will copy data from tableOrig on the srcCluster to the dstCluster’s tableCopy table.
Incremental HBase table copies
Once you have a copy of a table on a destination cluster, how do you do copy new data that is later written to the source cluster?
Naively, you could run the CopyTable job again and copy over the entire table. However, CopyTable provides a more efficient incremental
copy mechanism that just copies the updated rows from the srcCluster to the backup dstCluster specified in a window of time. Thus, after the initial copy, you could then have a periodic cron job that copies data from only the previous hour from srcCluster
to the dstCuster.
This is done by specifying the –starttime and –endtime arguments. Times are specified as decimal milliseconds since unix epoch time.
# WARNING: In older versions, you are not alerted about any typo in these arguments!
# copy from beginning of time until timeEnd
# NOTE: Must include start time for end time to be respected. start time cannot be 0.
srcCluster$ hbase org.apache.hadoop.HBase.mapreduce.CopyTable ... --starttime=1 --endtime=timeEnd ...
# Copy from starting from and including timeStart until the end of time.
srcCluster$ hbase org.apache.hadoop.HBase.mapreduce.CopyTable ... --starttime=timeStart ...
# Copy entries rows with start time1 including time1 and ending at timeStart excluding timeEnd.
srcCluster$ hbase org.apache.hadoop.HBase.mapreduce.CopyTable ... --starttime=timestart --endtime=timeEnd
Partial HBase table copies and HBase table schema changes
By default, CopyTable will copy all column families from matching rows. CopyTable provides options for only copying data from specific column-families. This could be useful for copying original source data and excluding derived data column families that are
added by follow on processing.
By adding these arguments we only copy data from the specified column families.
- –families=srcCf1
- –families=srcCf1,srcCf2
Starting from 0.92.0 you can copy while changing the column family name:
- –families=srcCf1:dstCf1
- copy from srcCf1 to dstCf1
- –families=srcCf1:dstCf1,dstCf2,srcCf3:dstCf3
- copy from srcCf1 to destCf1, copy dstCf2 to dstCf2 (no rename), and srcCf3 to dstCf3
Please note that dstCf* must be present in the dstCluster table!
Starting from 0.94.0 new options are offered to copy delete markers and to include a limited number of overwritten versions. Previously, if a row is deleted in the source cluster, the delete would not be copied — instead that a stale version of that row would
remain in the destination cluster. This takes advantage of some of the 0.94.0 release’s advanced features.
- –versions=vers
- where vers is the number of cell versions to copy (default is 1 aka the latest only)
- –all.cells
- also copy delete markers and deleted cells
Common Pitfalls
The HBase client in the 0.90.x, 0.92.x, and 0.94.x versions always use zoo.cfg if it is in the classpath, even if an hbase-site.xml file specifies other ZooKeeper quorum configuration settings. This “feature” causes a problem common in CDH3 HBase because its
packages default to including a directory where zoo.cfg lives in HBase’s classpath. This can and has lead to frustration when trying to use CopyTable (HBASE-4614). The workaround for this is to exclude the zoo.cfg file from your HBase’s classpath and to specify
ZooKeeper configuration properties in your hbase-site.xml file. http://hbase.apache.org/book.html#zookeeper
Conclusion
CopyTable provides simple but effective disaster recovery insurance for HBase 0.90.x (CDH3) deployments. In conjunction with the replication feature found and supported in CDH4’s HBase 0.92.x based HBase, CopyTable’s incremental features become less valuable
but its core functionality is important for bootstrapping a replicated table. While more advanced features such as HBase snapshots (HBASE-50) may aid with disaster recovery when it gets implemented, CopyTable will still be a useful tool for the HBase administrator.
使用CopyTable工具方法在线备份HBase表的更多相关文章
- HBase表的备份
HBase表备份其实就是先将Table导出,再导入两个过程. 导出过程 //hbase org.apache.hadoop.hbase.mapreduce.Driver export 表名 数据文件位 ...
- pt-online-schema-change工具使用教程(在线修改大表结构)
percona-toolkit中pt-online-schema-change工具安装和使用 pt-online-schema-change介绍 使用场景:在线修改大表结构 在线数据库的维护中,总会涉 ...
- 浅谈hbase表中数据导出导入(也就是备份)
转自:http://blog.chinaunix.net/xmlrpc.php?r=blog/article&uid=23916356&id=3321832 最近因为生产环境hbase ...
- 一种HBase表数据迁移方法的优化
1.背景调研: 目前存在的hbase数据迁移主要分如下几类: 根据上图,可以看出: 其实主要分为两种方式:(1)hadoop层:因为hbase底层是基于hdfs存储的,所以可以通过把hdfs上的数据拷 ...
- mysql导出csv/sql/newTable/txt的方法,mysql的导入txt/sql方法...mysql备份恢复mysqlhotcopy、二进制日志binlog、直接备份文件、备份策略、灾难恢复.....................................................
mysql备份表结构和数据 方法一. Create table new_table_nam备份到新表:MYSQL不支持: Select * Into new_table_name from old_t ...
- Linux操作系统备份之一:使用LVM快照实现Linux操作系统数据的在线备份
这里我们讨论Linux操作系统的备份. 在生产环境,客户都会要求做全系统的数据备份,用于系统崩溃后的一种恢复手段.这其中就包含操作系统数据的备份恢复. 由于是生产环境,客户都会要求备份不中断业务,也就 ...
- oracle在线重定义表
在一个高可用系统中,如果需要改变一个表的定义是一件比较棘手的问题,尤其是对于7×24系统.Oracle提供的基本语法基本可以满足一般性修改,但是对于把普通堆表改为分区表,把索引组织表修改为堆表等操作就 ...
- 使用exp&imp工具进行数据库备份及恢复
使用exp&imp工具进行数据库备份及恢复1.exp/imp使用方法介绍exp/imp为一种数据库备份恢复工具,也可以作为不同数据库之间传递数据的工具,两个数据库所在的操作系统可以不同.exp ...
- dbms_redefinition在线重定义表结构 可以在表分区的时候使用
dbms_redefinition在线重定义表结构 (2013-08-29 22:52:58) 转载▼ 标签: dbms_redefinition 非分区表转换成分区表 王显伟 在线重定义表结构 在线 ...
随机推荐
- .net中session的使用
什么是Session? Session即会话,是指一个用户在一段时间内对某一个站点的一次访问. Session对象在.NET中对应HttpSessionState类,表示"会话状态" ...
- BZOJ4599[JLoi2016&LNoi2016]成绩比较(dp+拉格朗日插值)
这个题我们首先可以dp,f[i][j]表示前i个科目恰好碾压了j个人的方案数,然后进行转移.我们先不考虑每个人的分数,先只关心和B的相对大小关系.我们设R[i]为第i科比B分数少的人数,则有f[i][ ...
- bzoj 3944 杜教筛
题目中要求phi和miu的前缀和,利用杜教筛可以推出公式.我们令为 那么有公式 类比欧拉函数,我们可以推出莫比乌斯函数的和公式为 (公式证明懒得写了,主要核心是利用Dirichlet卷积的性质 ph ...
- 【筛法求素数】【推导】【组合数】UVALive - 7642 - Prime Distance
题意:n个格子,m个球,让你把球放入某些格子里,使得所有有球的格子之间的距离(abs(i-j))均为素数 ,让你输出方案数. 只占一个格子或者两个格子显然可行. 占有三个格子的情况下,则必须保证其中两 ...
- python3-开发进阶Flask的基础
一.概述 最大的特点:短小精悍.可拓展强的一个Web框架.注意点:上下文管理机制,依赖wsgi:werkzurg 模块 二.前奏学习werkzurg 先来回顾一个知识点:一个类加括号会执行__init ...
- 2349 Arctic Network(中文版)
试题描述: 国防部希望通过无线网络连接几个北方前哨基地. 在建立网络时将使用两种不同的通信技术:每个前哨基站都将拥有无线电收发器,另外还有一些前哨卫星通道. 任何带卫星频道的两个前哨都可以通过卫星进行 ...
- MYSQL学习笔记 (六)explain分析查询
使用EXPLAIN可以模拟优化器执行SQL语句,从而知道MYSQL是如何处理你的SQL,从而分析查询语句或者表结构的瓶颈.
- Apache commons——Apache旗下的通用工具包项目
Apache Commons是Apache旗下的一个开源项目,包含了很多开源的工具,用于解决平时编程经常会遇到的问题,减少重复劳动 这里是Apache commons的官方网站 下面是工具的简单介绍: ...
- vue2.0中引入UEditor的一些坑。。。。
开发后台系统的时候,富文本编辑器肯定是必不可少的,然后呢~在天朝当然要属百度编辑器(UEditor)最成熟了,功能全面,文档齐全(相对),ui优美(...,对于程序员来说)等等许多方面(MMP,还不是 ...
- Debian 安装记录
1.蓝色标注是安装的部分或配置的. 作者:http://www.cppblog.com/jinglexy上海体育馆 2.linux 发行版测评网站:www.distrowatch.com 打 ...