使用tungsten将mysql的数据同步到hadoop

背景

线上有很多的数据库在运行，后台需要一个分析用户行为的数据仓库。目前比较流行的是mysql和hadoop平台。

现在的问题是，如何将线上的mysql数据实时的同步到hadoop中，以供分析。这篇文章就是利用tungsten-replicator来实现。

环境

由于tungsten-replicator依赖ruby和gem。需要安装

yum install ruby

yum install rubygems

gem install json

其中json模块可能因为gfw的原因，需要手动下载到本地，然后使用gem本地安装

yum install ruby-devel
gem install --local json-xxx.gem

安装好mysql，地址是 192.168.12.223:3306 ，数据库配置好权限

安装好hadoop 2.4 ，hdfs的地址是 192.168.12.221:9000

配置

先在mysql的机器上，进入到tungsten-replicator目录下执行，并且启动tungsten，可以使用trepctl thl 等命令查看服务的状态

./tools/tpm install mysql1 --master=192.168.12.223 --install-directory=/user/app/tungsten/mysql1 --datasource-mysql-conf=/user/data/mysql_data/my-3306.cnf --replication-user=stats --replication-password=stats_dh5 --enable-heterogenous-master=true --net-ssh-option=port=20460  --property=replicator.filter.pkey.addColumnsToDeletes=true --property=replicator.filter.pkey.addPkeyToInserts=true

mysql1/tungsten/cluster-home/bin/startall

到hadoop的机器上，，进入到tungsten-replicator目录下执行，并且启动tungsten，可以使用trepctl thl 等命令查看服务的状态

./tools/tpm install hadoop1 --batch-enabled=true --batch-load-language=js --batch-load-template=hadoop --datasource-type=file --install-directory=/user/app/tungsten/hadoop1 --java-file-encoding=UTF8 --java-user-timezone=GMT --master=192.168.12.223 --members=192.168.12.221 --property=replicator.datasource.applier.csvType=hive --property=replicator.stage.q-to-dbms.blockCommitInterval=1s --property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 --skip-validation-check=DatasourceDBPort --skip-validation-check=DirectDatasourceDBPort --skip-validation-check=HostsFileCheck --skip-validation-check=InstallerMasterSlaveCheck --skip-validation-check=ReplicationServicePipelines --rmi-port=25550

可以在hadoop的文件系统上，查看对应的目录下是否生成了mysql对应的库。如下所示：

└── user

......

......

    └── tungsten

        └── staging

            └── hadoop1

                └── db1

                    ├── x1

                    │   ├── x1-14.csv

                    │   └── x1-3.csv

                    └── x2

                        ├── x2-115.csv

                        ├── x2-15.csv

                        ├── x2-16.csv

                        ├── x2-17.csv

                        └── x2-18.csv

最后还需要将staging的数据merge到hive中，建立hive的表结构，并且让数据能够被hive查询，这里使用continuent-tools-hadoop工具里面的load-reduce-check脚本，在使用之前，先需要配置好hive的环境变量，并且启动hiveservice在10000端口上。拷贝如下的jar包到bristlecone的lib-ext目录

 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/hive-jdbc-0.13.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/hive-service-0.13.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/httpclient-4.2.5.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/commons-httpclient-3.0.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/httpcore-4.2.5.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

 cp -v /user/app/hadoop/hadoop-2.4.0-onenode/share/hadoop/common/hadoop-common-2.4.0.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

 cp -v /user/app/hadoop/hadoop-2.4.0-onenode/share/hadoop/common/lib/slf4j-* /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

然后执行如下的命令：

第一次，或者以后增加了表，或者表结构发生了变化

./bin/load-reduce-check -v -U jdbc:mysql:thin://192.168.12.223:3306/ -u stats -p stats_dh5 --schema db1 --service=hadoop1 -r /user/app/tungsten/hadoop1  --no-compare

如果表结构没有发生变化，只需要重新装载数据的话，可以执行如下的命令

./bin/load-reduce-check -v -U jdbc:mysql:thin://192.168.12.223:3306/ -u stats -p stats_dh5 --schema db1 --service=hadoop1 -r /user/app/tungsten/hadoop1  --no-base-ddl --no-staging-ddl --no-meta

只想比较数据，不过貌似compare很卡

./bin/load-reduce-check -v -U jdbc:mysql:thin://192.168.12.223:3306/ -u stats -p stats_dh5 --schema db1 --service=hadoop1 -r /user/app/tungsten/hadoop1  --no-base-ddl --no-staging-ddl --no-meta --no-materialize

参考

tungsten-replicator-3.0.pdf 中的 3.4. Deploying MySQL to Hadoop Replication

https://github.com/continuent/continuent-tools-hadoop

使用tungsten将mysql的数据同步到hadoop的更多相关文章

两台Mysql数据库数据同步实现
两台Mysql数据库数据同步实现做开发的时候要做Mysql的数据库同步,两台安装一样的系统,都是FreeBSD5.4,安装了Apache 2.0.55和PHP 4.4.0,Mysql的版本是4.1. ...
Goldengate完成Mysql到Mysql的数据同步
文档参考地址:http://blog.csdn.net/u010587433/article/details/49305019 需求: 使用Goldengate完成Mysql到Mysql的数据同步,源 ...
怎么通过 Mysql 实现数据同步呢？
怎么使 mysql 数据同步先假设有主机 A 和 B ( linux 系统),主机 A 的 IP 分别是 1.2.3.4 (当然,也可以是动态的),主机 B 的 IP 是 5.6.7.8 .两个主机都 ...
MySQL主从数据同步延时分析
一.MySQL数据库主从同步延迟要了解MySQL数据库主从同步延迟原理,我们 ...
减少mysql主从数据同步延迟
网上给出的解决办法: 基于局域网的master/slave机制在通常情况下已经可以满足'实时'备份的要求了.如果延迟比较大,就先确认以下几个因素:1. 网络延迟2. master负载3. slave负 ...
redis和mySql的数据同步的解析
1.同步MySQL数据到Redis (1) 在redis数据库设置缓存时间,当该条数据缓存时间过期之后自动释放,去数据库进行重新查询,但这样的话,我们放在缓存中的数据对数据的一致性要求不是很高才能放入 ...
使用Canal作为mysql的数据同步工具
一.Canal介绍 1.应用场景在前面的统计分析功能中,我们采取了服务调用获取统计数据,这样耦合度高,效率相对较低,目前我采取另一种实现方式,通过实时同步数据库表的方式实现,例如我们要统计每天注册与 ...
Mysql主从数据同步cheksum问题
做主从同步时出现问题,show slave status显示错误: Last_IO_Error: Got fatal error from master when reading data from ...
mysql 主从数据同步配置
一主一从,单向同步 master 数据库的数据变更单向同步到 slave 数据库互为主从,双向同步 master 数据库的数据变更同步到 slave 数据库,slave 数据库的数据边同步到 mas ...

随机推荐

vertx verticle
以下内容为随手记的,若看客不知鄙人所云,还请原谅则个.............. 公司用的vertx,在国内,这还是款比较年轻的框架,你也可以把他当做一个工具,官网上的说法是: Vert.x is a ...
nginx虚拟主机配置笔记
1.添加配置文件 /etc/nginx/sites-available/ 下新建文件 phpmyadmin 文件内容 server { listen 80; listen [::]:80; serve ...
int类型的正负数转换
int aid = -this.id; 不能直接转必须先赋值给一个变量 int c = this.id; int a = c * (-1); this.id = a;
《UNIX环境高级编程第三版》apue.h等源码文件的编译安装
操作系统:Ubuntu 12/14 1.下载书中的源代码:点击下载 2.编译 tar -zxvf *.tar.gz cd ./apue.3e make 报错: can,t find -lbsd 解决办 ...
使用php+swoole对client数据实时更新(上)
如果想对一个列表做实时的更新,传统的做法是采用轮询的方式.以web为例,通过Ajax定时请求服务端然后获取数据显示在页面.这种方式实现简单,缺点就是浪费资源. HTTP1.1新增加了对websocke ...
GDI+中发生一般性错误的解决办法
这个错误经常发生,代码如下: private static byte[] GetBytes (Image image) { try { if (image == null) return null ...
python 内置&&递归
lambda 优点: 1:可以简单使用一个脚本来替代我们的函数 2:不用考虑命名的问题 3:简化代码的可读性,不用跳转到def了,省去这样的步骤内置函数:bif filter:过滤器 map:映射 ...
json死循环问题
20.JSON死循环问题: 向前台发送的数据: 出现此类问题主要是由于在所传数据中有包含关系,比如ElementGroup中有Element,Element中又有ElementGroup,此时就会出现 ...
[转]hql 语法与详细解释
HQL查询:Criteria查询对查询条件进行了面向对象封装,符合编程人员的思维方式,不过HQL(Hibernate Query Lanaguage)查询提供了更加丰富的和灵活的查询特性,因此 Hib ...
libevent在windows平台下通过vs进行编译
1.vs中新建一个静态库项目 2.配置头文件目录,将./compat../include../WIN32-Code三个目录添加到文件目录中 3.用记事本打开Makefile.nmake文件,可以看到里 ...

使用tungsten将mysql的数据同步到hadoop

使用tungsten将mysql的数据同步到hadoop的更多相关文章

随机推荐

热门专题