1.WordCount(统计单词)

经典的运用MapReuce编程模型的实例

1.1 Description

给定一系列的单词/数据,输出每个单词/数据的数量

1.2 Sample

 a is b is not c
b is a is not d

1.3 Output

 a:
b:
c:
d:
is:
not:

1.4 Solution

 /**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.hadoop.examples; import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { //map输出的<key,value>为<输入的单词/数据,1>即<Text,IntWritable>
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
//value为封装好的int即IntWritable
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());//word为每个单词/数据,以空格为分隔符识别
context.write(word, one);
}
}
} //reduce输入的<key,value>为<输入的单词/数据,各个值的1相加即sum(实际是一个list)>
//即<Text,IntWrite>
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
} public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
//删除已存在的输出文件夹
judgeFileExist(otherArgs[1]);
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
} //删除文件夹及其目录下的文件
public static void judgeFileExist(String path){
File file = new File(path);
if( file.exists() ){
deleteFileDir(file);
}
} public static void deleteFileDir(File path){
if( path.isDirectory() ){
String[] files = path.list();
for( int i=0;i<files.length;i++ ){
deleteFileDir( new File(path,files[i]) );
}
}
path.delete();
} }

2. 数据去重

2.1 Description

针对给定一系列的数据去重并输出

2.2 Sample

 3-1 a
3-2 b
3-3 c
3-4 d
3-5 a
3-6 b
3-7 c
3-3 c
3-1 b
3-2 a
3-3 b
3-4 d
3-5 a
3-6 c
3-7 d
3-3 c

2.3 Output

 3-1 a
3-1 b
3-2 a
3-2 b
3-3 b
3-3 c
3-4 d
3-5 a
3-6 b
3-6 c
3-7 c
3-7 d

2.4 Solution

 /**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.hadoop.examples; import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static class Map extends Mapper<Object,Text,Text,Text>{//map最后一个指定Text
public static Text lineWords= new Text(); //map输出为<Text,Text>,因为只涉及到是否Key存在的问题,故value可任意
public void map(Object key,Text value,Context context)
throws IOException, InterruptedException{
lineWords = value;
context.write(lineWords, new Text(""));//<Text,Text>
}
} public static class Reduce extends Reducer<Text,Text,Text,Text>{
public void reduce(Text key,Iterable<Text> values,Context context)
throws IOException, InterruptedException{
context.write(key,new Text(""));
}
} public static void main(String args[])
throws IOException, ClassNotFoundException, InterruptedException{
Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if( otherArgs.length!=2 ){
System.err.println("Usage: Data Deduplication <in> <out>");
System.exit(2);
} //删除已存在的输出文件夹
judgeFileExist(otherArgs[1]);
Job job = new Job(conf,"Data Dup");
job.setJarByClass(WordCount.class);
//设置map combine reduce处理类
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
//设置key value的类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//设置输入和输出目录
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
} //删除文件夹及其目录下的文件
public static void judgeFileExist(String path){
File file = new File(path);
if( file.exists() ){
deleteFileDir(file);
}
} public static void deleteFileDir(File path){
if( path.isDirectory() ){
String[] files = path.list();
for( int i=0;i<files.length;i++ ){
deleteFileDir( new File(path,files[i]) );
}
}
path.delete();
} }

3. 数据排序

3.1 Description

给多个文件的数据排序,每个文件中的每个数据占一行

3.2 Sample


3.3 Output


3.4 Solution

 /**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.hadoop.example; import java.io.File;
import java.io.IOException; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser; public class dataSort{ public static class map extends Mapper<Object,Text,IntWritable,IntWritable>{
private static IntWritable data = new IntWritable();
String lineWords = new String();
//map
public void map(Object key,Text value,Context context)
throws IOException, InterruptedException{
lineWords = value.toString();
data.set(Integer.parseInt(lineWords));
context.write(data,new IntWritable(1));
}
} public static class reduce extends Reducer<IntWritable, IntWritable,IntWritable,IntWritable>{
private static IntWritable lineNum = new IntWritable(1);
public void reduce(IntWritable key,Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException{
for(IntWritable val:values){
context.write(lineNum,key);
lineNum = new IntWritable(lineNum.get()+1);
}
}
} public static void main(String args[])
throws IOException, ClassNotFoundException, InterruptedException{
Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if( otherArgs.length!=2 ){
System.err.println("Usage: Data Deduplication <in> <out>");
System.exit(2);
} //删除已存在的输出文件夹
judgeFileExist(otherArgs[1]);
Job job = new Job(conf,"Data Dup");
job.setJarByClass(dataSort.class);
//设置map combine reduce处理类
job.setMapperClass(map.class);
job.setCombinerClass(reduce.class);
job.setReducerClass(reduce.class);
//设置key value的类型
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
//设置输入和输出目录
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
//删除文件夹及其目录下的文件
public static void judgeFileExist(String path){
File file = new File(path);
if( file.exists() ){
deleteFileDir(file);
}
} public static void deleteFileDir(File path){
if( path.isDirectory() ){
String[] files = path.list();
for( int i=0;i<files.length;i++ ){
deleteFileDir( new File(path,files[i]) );
}
}
path.delete();
}
}

MapReduce实例的更多相关文章

  1. MapReduce实例2(自定义compare、partition)& shuffle机制

    MapReduce实例2(自定义compare.partition)& shuffle机制 实例:统计流量 有一份流量数据,结构是:时间戳.手机号.....上行流量.下行流量,需求是统计每个用 ...

  2. MapReduce实例&YARN框架

    MapReduce实例&YARN框架 一个wordcount程序 统计一个相当大的数据文件中,每个单词出现的个数. 一.分析map和reduce的工作 map: 切分单词 遍历单词数据输出 r ...

  3. MapReduce实例浅析

    在文章<MapReduce原理与设计思想>中,详细剖析了MapReduce的原理,这篇文章则通过实例重点剖析MapReduce 本文地址:http://www.cnblogs.com/ar ...

  4. MapReduce实例-基于内容的推荐(一)

    环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境 数据:下载的amazon产品共同采购网络元数据(需FQ下载)http://snap.stanford.edu/data/ ...

  5. MapReduce实例-倒排索引

    环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境 数据:任意数量.格式的文本文件(我用的四个.java代码文件) 方案目标: 根据提供的文本文件,提取出每个单词在哪个文件 ...

  6. MapReduce实例-NASA博客数据频度简单分析

    环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境,gnuplot, 数据:http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.htm ...

  7. MapReduce实例——求平均值,所得结果无法写出到文件的错误原因及解决方案

    1.错误原因 mapreduce按行读取文本,map需要在原有基础上增加一个控制语句,使得读到空行时不执行write操作,否则reduce不接受,也无法输出到新路径. 2.解决方案 原错误代码 pub ...

  8. MapReduce实例(数据去重)

    数据去重: 原理(理解):Mapreduce程序首先应该确认<k3,v3>,根据<k3,v3>确定<k2,v2>,原始数据中出现次数超过一次的数据在输出文件中只出现 ...

  9. MapReduce实例——查询缺失扑克牌

    问题: 解决: 首先分为两个过程,Map过程将<=10的牌去掉,然后只针对于>10的牌进行分类,Reduce过程,将Map传过来的键值对进行统计,然后计算出少于3张牌的的花色 1.代码 1 ...

随机推荐

  1. hdu 4027 Can you answer these queries?

    题目连接 http://acm.hdu.edu.cn/showproblem.php?pid=4027 Can you answer these queries? Description Proble ...

  2. BASE64与单向加密算法MD5&SHA&MAC

    言归正传,这里我们主要描述Java已经实现的一些加密解密算法,最后介绍数字证书.     如基本的单向加密算法: BASE64 严格地说,属于编码格式,而非加密算法 MD5(Message Diges ...

  3. 【linux命令系列】熟练运用每一个光标移动到最前和最后

    ctrl+e?a和e      ahead 和 end 看一个真正的专家操作命令行绝对是一种很好的体验-光标在单词之间来回穿梭,命令行不同的滚动.在这里强烈建立适应GUI节目的开发者尝试一下在提示符下 ...

  4. squid基础配置

    1 2 3 4 5 6 7 8 9 10 vim /etc/squid/squid.conf    http_port 192.168.1.12:3128 (可写多个) cache_mem 64MB  ...

  5. 莫名戳中"肋骨"的文章

    1 起初,我们总是会害怕,害怕不能得到自己渴望的物质生活,害怕遇不到那个好好爱自己的人,害怕失去青春也换不回事业上的进步,害怕会做下一个让自己悔恨的决定,可这一路,我们就是这样踩着自己的害怕和悔恨走来 ...

  6. PHP URL 重定向 的三种方法(转载)

    为了方便查询,转载一篇. 1.使用header()函数    PHP的HTTP相关函数种提供了一个 header()函数,首先要清楚,header()函数必须放在php程序的开头部分,而且之前不能有另 ...

  7. [收藏]Spring Security中的ACL

    ACL即访问控制列表(Access Controller List),它是用来做细粒度权限控制所用的一种权限模型.对ACL最简单的描述就是两个业务员,每个人只能查看操作自己签的合同,而不能看到对方的合 ...

  8. 11.3Daily Scrum

    人员 任务分配完成情况 明天任务分配 王皓南 实现网页上视频上传的功能,研究相关的代码782 数据库测试 申开亮 实现网页上视频浏览的功能.研究相关的代码和功能.783 实现视频浏览的功能 王宇杰 负 ...

  9. 课题练习——找从1到N出现的1的个数

    #include<iostream.h>#include<conio.h>int Sum1(int n){ int count = 0; //记录1的个数 int factor ...

  10. Android 上传图片到 Asp.Net 服务器的问题

    最近在做一个手机app联合系统管理做的应用程序,管理程序管理数据的发布和增删改查,手机app负责显示和操作业务逻辑这么一个功能. 刚开始路走的都很顺,但是走到通过Android客户端上传图片到Asp. ...