为了方便 MapReduce 直接訪问关系型数据库(Mysql,Oracle)。Hadoop提供了DBInputFormat和DBOutputFormat两个类。通过DBInputFormat类把数据库表数据读入到HDFS,依据DBOutputFormat类把MapReduce产生的结果集导入到数据库表中。

   
执行MapReduce时候报错:java.io.IOException: com.mysql.jdbc.Driver,通常是因为程序找不到mysql驱动包。解决方法是让每一个tasktracker执行MapReduce程序时都能够找到该驱动包。

加入包有两种方式:

(1)在每一个节点下的${HADOOP_HOME}/lib下加入该包。重新启动集群,通常是比較原始的方法。

(2)a)把包传到集群上: hadoop fs -put mysql-connector-java-5.1.0- bin.jar /hdfsPath/

b)在mr程序提交job前,加入语句:DistributedCache.addFileToClassPath(new Path(“/hdfsPath/mysql- connector-java-5.1.0-bin.jar”),conf);

mysql数据库存储到hadoop hdfs

mysql表创建和数据初始化

DROP TABLE IF EXISTS `wu_testhadoop`;
CREATE TABLE `wu_testhadoop` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(255) DEFAULT NULL,
`content` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8; -- ----------------------------
-- Records of wu_testhadoop
-- ----------------------------
INSERT INTO `wu_testhadoop` VALUES ('1', '123', '122312');
INSERT INTO `wu_testhadoop` VALUES ('2', '123', '123456');

定义hadoop数据訪问

mysql表创建完成后,我们须要定义hadoop訪问mysql的规则。

hadoop提供了org.apache.hadoop.io.Writable接口来实现简单的高效的可序列化的协议,该类基于DataInput和DataOutput来实现相关的功能。

hadoop对数据库訪问也提供了org.apache.hadoop.mapred.lib.db.DBWritable接口,当中write方法用于对PreparedStatement对象设定值,readFields方法用于对从数据库读取出来的对象进行列的值绑定。

以上两个接口的使用例如以下(内容是从源代码得来)

writable

 public class MyWritable implements Writable {
// Some data
private int counter;
private long timestamp; public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
} public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
} public static MyWritable read(DataInput in) throws IOException {
MyWritable w = new MyWritable();
w.readFields(in);
return w;
}
}

DBWritable

public class MyWritable implements Writable, DBWritable {
// Some data
private int counter;
private long timestamp; //Writable#write() implementation
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
} //Writable#readFields() implementation
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
} public void write(PreparedStatement statement) throws SQLException {
statement.setInt(1, counter);
statement.setLong(2, timestamp);
} public void readFields(ResultSet resultSet) throws SQLException {
counter = resultSet.getInt(1);
timestamp = resultSet.getLong(2);
}
}

数据库相应的实现

package com.wyg.hadoop.mysql.bean;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException; import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.lib.db.DBWritable; public class DBRecord implements Writable, DBWritable{
private int id;
private String title;
private String content;
public int getId() {
return id;
} public void setId(int id) {
this.id = id;
} public String getTitle() {
return title;
} public void setTitle(String title) {
this.title = title;
} public String getContent() {
return content;
} public void setContent(String content) {
this.content = content;
} @Override
public void readFields(ResultSet set) throws SQLException {
this.id = set.getInt("id");
this.title = set.getString("title");
this.content = set.getString("content");
} @Override
public void write(PreparedStatement pst) throws SQLException {
pst.setInt(1, id);
pst.setString(2, title);
pst.setString(3, content);
} @Override
public void readFields(DataInput in) throws IOException {
this.id = in.readInt();
this.title = Text.readString(in);
this.content = Text.readString(in);
} @Override
public void write(DataOutput out) throws IOException {
out.writeInt(this.id);
Text.writeString(out, this.title);
Text.writeString(out, this.content);
} @Override
public String toString() {
return this.id + " " + this.title + " " + this.content;
}
}

实现Map/Reduce

package com.wyg.hadoop.mysql.mapper;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter; import com.wyg.hadoop.mysql.bean.DBRecord; @SuppressWarnings("deprecation")
public class DBRecordMapper extends MapReduceBase implements Mapper<LongWritable, DBRecord, LongWritable, Text>{ @Override
public void map(LongWritable key, DBRecord value,
OutputCollector<LongWritable, Text> collector, Reporter reporter)
throws IOException {
collector.collect(new LongWritable(value.getId()), new Text(value.toString()));
} }

測试hadoop连接mysql并将数据存储到hdfs

package com.wyg.hadoop.mysql.db;
import java.io.IOException; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.lib.IdentityReducer;
import org.apache.hadoop.mapred.lib.db.DBConfiguration;
import org.apache.hadoop.mapred.lib.db.DBInputFormat; import com.wyg.hadoop.mysql.bean.DBRecord;
import com.wyg.hadoop.mysql.mapper.DBRecordMapper; public class DBAccess {
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(DBAccess.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(Text.class);
conf.setInputFormat(DBInputFormat.class);
Path path = new Path("hdfs://192.168.44.129:9000/user/root/dbout");
FileOutputFormat.setOutputPath(conf, path);
DBConfiguration.configureDB(conf,"com.mysql.jdbc.Driver", "jdbc:mysql://你的ip:3306/数据库名","username","password");
String [] fields = {"id", "title", "content"};
DBInputFormat.setInput(conf, DBRecord.class, "wu_testhadoop",
null, "id", fields);
conf.setMapperClass(DBRecordMapper.class);
conf.setReducerClass(IdentityReducer.class);
JobClient.runJob(conf);
}
}

运行程序,结果例如以下:

15/08/11 16:46:18 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/08/11 16:46:18 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/08/11 16:46:18 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/08/11 16:46:19 INFO mapred.JobClient: Running job: job_local_0001
15/08/11 16:46:19 INFO mapred.MapTask: numReduceTasks: 1
15/08/11 16:46:19 INFO mapred.MapTask: io.sort.mb = 100
15/08/11 16:46:19 INFO mapred.MapTask: data buffer = 79691776/99614720
15/08/11 16:46:19 INFO mapred.MapTask: record buffer = 262144/327680
15/08/11 16:46:19 INFO mapred.MapTask: Starting flush of map output
15/08/11 16:46:19 INFO mapred.MapTask: Finished spill 0
15/08/11 16:46:19 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
15/08/11 16:46:19 INFO mapred.LocalJobRunner:
15/08/11 16:46:19 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
15/08/11 16:46:19 INFO mapred.LocalJobRunner:
15/08/11 16:46:19 INFO mapred.Merger: Merging 1 sorted segments
15/08/11 16:46:19 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 48 bytes
15/08/11 16:46:19 INFO mapred.LocalJobRunner:
15/08/11 16:46:19 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
15/08/11 16:46:19 INFO mapred.LocalJobRunner:
15/08/11 16:46:19 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
15/08/11 16:46:19 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.44.129:9000/user/root/dbout
15/08/11 16:46:19 INFO mapred.LocalJobRunner: reduce > reduce
15/08/11 16:46:19 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
15/08/11 16:46:20 INFO mapred.JobClient: map 100% reduce 100%
15/08/11 16:46:20 INFO mapred.JobClient: Job complete: job_local_0001
15/08/11 16:46:20 INFO mapred.JobClient: Counters: 14
15/08/11 16:46:20 INFO mapred.JobClient: FileSystemCounters
15/08/11 16:46:20 INFO mapred.JobClient: FILE_BYTES_READ=34606
15/08/11 16:46:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=69844
15/08/11 16:46:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=30
15/08/11 16:46:20 INFO mapred.JobClient: Map-Reduce Framework
15/08/11 16:46:20 INFO mapred.JobClient: Reduce input groups=2
15/08/11 16:46:20 INFO mapred.JobClient: Combine output records=0
15/08/11 16:46:20 INFO mapred.JobClient: Map input records=2
15/08/11 16:46:20 INFO mapred.JobClient: Reduce shuffle bytes=0
15/08/11 16:46:20 INFO mapred.JobClient: Reduce output records=2
15/08/11 16:46:20 INFO mapred.JobClient: Spilled Records=4
15/08/11 16:46:20 INFO mapred.JobClient: Map output bytes=42
15/08/11 16:46:20 INFO mapred.JobClient: Map input bytes=2
15/08/11 16:46:20 INFO mapred.JobClient: Combine input records=0
15/08/11 16:46:20 INFO mapred.JobClient: Map output records=2
15/08/11 16:46:20 INFO mapred.JobClient: Reduce input records=2

同一时候能够看到hdfs文件系统多了一个dbout的文件夹,里边的文件保存了数据库相应的数据,内容保存例如以下

1	1 123 122312
2 2 123 123456

hdfs数据导入到mysql

hdfs文件存储到mysql,也须要上边的DBRecord类作为辅助。由于数据库的操作都是通过DBInput和DBOutput来进行的;

首先须要定义map和reduce的实现(map用以对hdfs的文档进行解析,reduce解析map的输出并输出)

package com.wyg.hadoop.mysql.mapper;

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Iterator; import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import com.wyg.hadoop.mysql.bean.DBRecord; public class WriteDB {
// Map处理过程
public static class Map extends MapReduceBase implements Mapper<Object, Text, Text, DBRecord> {
private final static DBRecord one = new DBRecord(); private Text word = new Text(); @Override public void map(Object key, Text value, OutputCollector<Text, DBRecord> output, Reporter reporter) throws IOException { String line = value.toString();
String[] infos = line.split(" ");
String id = infos[0].split(" ")[1];
one.setId(new Integer(id));
one.setTitle(infos[1]);
one.setContent(infos[2]);
word.set(id);
output.collect(word, one);
} } public static class Reduce extends MapReduceBase implements
Reducer<Text, DBRecord, DBRecord, Text> {
@Override
public void reduce(Text key, Iterator<DBRecord> values,
OutputCollector<DBRecord, Text> collector, Reporter reporter)
throws IOException {
DBRecord record = values.next();
collector.collect(record, new Text());
}
}
}

測试hdfs导入数据到数据库

package com.wyg.hadoop.mysql.db;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.lib.db.DBConfiguration;
import org.apache.hadoop.mapred.lib.db.DBInputFormat;
import org.apache.hadoop.mapred.lib.db.DBOutputFormat; import com.wyg.hadoop.mysql.bean.DBRecord;
import com.wyg.hadoop.mysql.mapper.WriteDB; public class DBInsert {
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WriteDB.class);
// 设置输入输出类型 conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(DBOutputFormat.class); // 不加这两句,通只是,可是网上给的样例没有这两句。
//Text, DBRecord
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(DBRecord.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(DBRecord.class);
// 设置Map和Reduce类
conf.setMapperClass(WriteDB.Map.class);
conf.setReducerClass(WriteDB.Reduce.class);
// 设置输如文件夹
FileInputFormat.setInputPaths(conf, new Path("hdfs://192.168.44.129:9000/user/root/dbout"));
// 建立数据库连接
DBConfiguration.configureDB(conf,"com.mysql.jdbc.Driver", "jdbc:mysql://数据库ip:3306/数据库名称","username","password");
String[] fields = {"id","title","content" };
DBOutputFormat.setOutput(conf, "wu_testhadoop", fields);
JobClient.runJob(conf);
} }

測试结果例如以下

15/08/11 18:10:15 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/08/11 18:10:15 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/08/11 18:10:15 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/08/11 18:10:15 INFO mapred.FileInputFormat: Total input paths to process : 1
15/08/11 18:10:15 INFO mapred.JobClient: Running job: job_local_0001
15/08/11 18:10:15 INFO mapred.FileInputFormat: Total input paths to process : 1
15/08/11 18:10:15 INFO mapred.MapTask: numReduceTasks: 1
15/08/11 18:10:15 INFO mapred.MapTask: io.sort.mb = 100
15/08/11 18:10:15 INFO mapred.MapTask: data buffer = 79691776/99614720
15/08/11 18:10:15 INFO mapred.MapTask: record buffer = 262144/327680
15/08/11 18:10:15 INFO mapred.MapTask: Starting flush of map output
15/08/11 18:10:16 INFO mapred.MapTask: Finished spill 0
15/08/11 18:10:16 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
15/08/11 18:10:16 INFO mapred.LocalJobRunner: hdfs://192.168.44.129:9000/user/root/dbout/part-00000:0+30
15/08/11 18:10:16 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
15/08/11 18:10:16 INFO mapred.LocalJobRunner:
15/08/11 18:10:16 INFO mapred.Merger: Merging 1 sorted segments
15/08/11 18:10:16 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 40 bytes
15/08/11 18:10:16 INFO mapred.LocalJobRunner:
15/08/11 18:10:16 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
15/08/11 18:10:16 INFO mapred.LocalJobRunner: reduce > reduce
15/08/11 18:10:16 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
15/08/11 18:10:16 INFO mapred.JobClient: map 100% reduce 100%
15/08/11 18:10:16 INFO mapred.JobClient: Job complete: job_local_0001
15/08/11 18:10:16 INFO mapred.JobClient: Counters: 14
15/08/11 18:10:16 INFO mapred.JobClient: FileSystemCounters
15/08/11 18:10:16 INFO mapred.JobClient: FILE_BYTES_READ=34932
15/08/11 18:10:16 INFO mapred.JobClient: HDFS_BYTES_READ=60
15/08/11 18:10:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=70694
15/08/11 18:10:16 INFO mapred.JobClient: Map-Reduce Framework
15/08/11 18:10:16 INFO mapred.JobClient: Reduce input groups=2
15/08/11 18:10:16 INFO mapred.JobClient: Combine output records=0
15/08/11 18:10:16 INFO mapred.JobClient: Map input records=2
15/08/11 18:10:16 INFO mapred.JobClient: Reduce shuffle bytes=0
15/08/11 18:10:16 INFO mapred.JobClient: Reduce output records=2
15/08/11 18:10:16 INFO mapred.JobClient: Spilled Records=4
15/08/11 18:10:16 INFO mapred.JobClient: Map output bytes=34
15/08/11 18:10:16 INFO mapred.JobClient: Map input bytes=30
15/08/11 18:10:16 INFO mapred.JobClient: Combine input records=0
15/08/11 18:10:16 INFO mapred.JobClient: Map output records=2
15/08/11 18:10:16 INFO mapred.JobClient: Reduce input records=2

測试之前我对原有表进行了清空处理,能够看到运行后数据库里边加入了两条内容;

下次在运行的时候会报错,属于正常情况,原因在于我们导入数据的时候对id进行赋值了,假设忽略id。是能够一直加入的;

源代码下载地址

源代码已上传,下载地址为download.csdn.net/detail/wuyinggui10000/8974585

一步一步跟我学习hadoop(7)----hadoop连接mysql数据库运行数据读写数据库操作的更多相关文章

  1. 学习总结------用JDBC连接MySQL

    1.下载MySQL的JDBC驱动 地址:https://dev.mysql.com/downloads/connector/ 为了方便,直接就选择合适自己的压缩包 跳过登录,选择直接下载 下载完成后, ...

  2. Django学习(一)连接mysql

    python3.6 Django2.0 几个改动的点: 1)setting: 2)__init__.py import pymysql 然后再重启server python manage.py run ...

  3. python基础学习24----使用pymysql连接mysql

    使用pymysql连接mysql 安装pymysql pymysql安装可以通过两种方式 使用pip安装 首先简单说一下pip的使用方法 获取帮助 pip --help 升级 pip pip inst ...

  4. R基础学习(一)-- 连接mysql数据库

    测试环境:win10+RStudio (1)在Console加载两个插件 >install.packages('DBI') Installing package into ‘C:/Users/l ...

  5. golang web框架 beego 学习 (四) 连接mysql

    1 DB参数配置在app.conf appname = gowebProject httpport = runmode = dev [db] host= localhost port= databas ...

  6. Oracle学习总结(4)——MySql、SqlServer、Oracle数据库行转列大全

    MySql行转列 以id分组,把name字段的值打印在一行,逗号分隔(默认) select CustomerDrugCode,group_concat(AuditItemName) from noau ...

  7. Mysql学习总结(35)——Mysql两千万数据优化及迁移

    最近有一张2000W条记录的数据表需要优化和迁移.2000W数据对于MySQL来说很尴尬,因为合理的创建索引速度还是挺快的,再怎么优化速度也得不到多大提升.不过这些数据有大量的冗余字段和错误信息,极不 ...

  8. python web开发——django学习(一)第一个连接mysql数据库django网站运行成功

    1.新建一个项目 2.新建一些文件夹方便管理 3.新建一个项目叫message  4.连接数据库 python web开发Django连接mysql 5.在数据库里自动生成django的表  6.运行 ...

  9. Django学习系列15:把POST请求中的数据存入数据库

    要修改针对首页中的POST请求的测试.希望视图把新添加的待办事项存入数据库,而不是直接传给响应. 为了测试这个操作,要在现有的测试方法test_can_save_a_post_request中添加3行 ...

随机推荐

  1. intel dpdk在ubuntu12.04中測试testpmd、helloworld程序

    一.測试环境 操作系统:ubuntu12.04   x86_64 dpdk版本号:1.6.0r2 虚拟机:vmware 10 网卡: Intel Corporation 82545EM Gigabit ...

  2. HDOJ 3339 In Action

    最短路+01背包 In Action Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Other ...

  3. 略微复杂的sql逻辑(从数据库逆序查找有限条记录(limit))并按相反顺序输出

    项目中有一个业务需求是:默认载入15条历史记录(按时间顺序从早到晚). 以下是我构造的sql逻辑,mark一下,亲測可行. SELECT * FROM (SELECT *FROM group_chat ...

  4. Google C++ style guide——格式

    1.行长度 每一行代码字符数不超过80. 例外: 1)假设一行凝视包括了超过80字符的命令或URL,出于复制粘贴的方便能够超过80字符: 2)包括长路径的能够超出80列,尽量避免: 3)头文件保护能够 ...

  5. Oracle数据的基本操作

    一.什么是Oracle 在学习DRP系统之前,非常多次提到过Oracle,也了解过,那么Oracle是什么?今天我最终揭开了它的神奇面纱. Oracle:是一个公司.当然我在这里说的是Oracle数据 ...

  6. 编写函数int count_number_string(char str[])和函数int maxnum_string(char str[])

    题目如图: 这里不再赘述 代码: //字符串中统计与查询 //杨鑫 #include <stdio.h> #include <stdlib.h> #include <st ...

  7. HDU 1754(线段树区间最值)

    I Hate It Time Limit: 9000/3000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others) Total ...

  8. nyoj--16--矩形嵌套(动态规划)

    矩形嵌套 时间限制:3000 ms  |  内存限制:65535 KB 难度:4 描述 有n个矩形,每个矩形可以用a,b来描述,表示长和宽.矩形X(a,b)可以嵌套在矩形Y(c,d)中当且仅当a< ...

  9. Oracle数据库三种标准的备份方法

    Oracle数据库的三种标准的备份方法: 1.导出/导入(EXP/IMP). 2.热备份. 3.冷备份. 注释:导出备件是一种逻辑备份,冷备份和热备份是物理备份. 一.导出/导入(Export/Imp ...

  10. C#对 Json的序列化和反序列化时出现“k_BackingField”

    在C#2.0的项目中,以前经常使用Json.NET实现序列化和反序列化.后来从c#3.0中开始使用新增的DataContractJsonSerializer进行json相关的操作.微软提供的原生类库使 ...