MapReduce实例

1.WordCount（统计单词）

经典的运用MapReuce编程模型的实例

1.1 Description

给定一系列的单词/数据，输出每个单词/数据的数量

1.2 Sample

 a is b is not c

 b is a is not d

1.3 Output

 a:

 b:

 c:

 d:

 is:

 not:

1.4 Solution

 /**

  *  Licensed under the Apache License, Version 2.0 (the "License");

  *  you may not use this file except in compliance with the License.

  *  You may obtain a copy of the License at

  *

  *      http://www.apache.org/licenses/LICENSE-2.0

  *

  *  Unless required by applicable law or agreed to in writing, software

  *  distributed under the License is distributed on an "AS IS" BASIS,

  *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  *  See the License for the specific language governing permissions and

  *  limitations under the License.

  */    

 package org.apache.hadoop.examples;

 import java.io.File;

 import java.io.IOException;

 import java.util.StringTokenizer;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.GenericOptionsParser;

 public class WordCount {

   //map输出的<key,value>为<输入的单词/数据,1>即<Text,IntWritable>

   public static class TokenizerMapper

        extends Mapper<Object, Text, Text, IntWritable>{

     //value为封装好的int即IntWritable

     private final static IntWritable one = new IntWritable(1);

     private Text word = new Text();

     public void map(Object key, Text value, Context context

                     ) throws IOException, InterruptedException {

       StringTokenizer itr = new StringTokenizer(value.toString());

       while (itr.hasMoreTokens()) {

         word.set(itr.nextToken());//word为每个单词/数据,以空格为分隔符识别

         context.write(word, one);

       }

     }

   }

   //reduce输入的<key,value>为<输入的单词/数据,各个值的1相加即sum(实际是一个list)>

   //即<Text,IntWrite>

   public static class IntSumReducer

        extends Reducer<Text,IntWritable,Text,IntWritable> {

     private IntWritable result = new IntWritable();

     public void reduce(Text key, Iterable<IntWritable> values,

                        Context context

                        ) throws IOException, InterruptedException {

       int sum = 0;

       for (IntWritable val : values) {

         sum += val.get();

       }

       result.set(sum);

       context.write(key, result);

     }

   }

   public static void main(String[] args) throws Exception {

     Configuration conf = new Configuration();

     String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

     if (otherArgs.length != 2) {

       System.err.println("Usage: wordcount <in> <out>");

       System.exit(2);

     }

     //删除已存在的输出文件夹

     judgeFileExist(otherArgs[1]);

     Job job = new Job(conf, "word count");

     job.setJarByClass(WordCount.class);

     job.setMapperClass(TokenizerMapper.class);

     job.setCombinerClass(IntSumReducer.class);

     job.setReducerClass(IntSumReducer.class);

     job.setOutputKeyClass(Text.class);

     job.setOutputValueClass(IntWritable.class);

     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

     System.exit(job.waitForCompletion(true) ? 0 : 1);

   }

   //删除文件夹及其目录下的文件

   public static void judgeFileExist(String path){

       File file = new File(path);

       if( file.exists() ){

           deleteFileDir(file);

       }

   }

   public static void deleteFileDir(File path){

       if( path.isDirectory() ){

           String[] files = path.list();

           for( int i=0;i<files.length;i++ ){

               deleteFileDir( new File(path,files[i]) );

           }

       }

       path.delete();

   }

 }

2. 数据去重

2.1 Description

针对给定一系列的数据去重并输出

2.2 Sample

 3-1 a

 3-2 b

 3-3 c

 3-4 d

 3-5 a

 3-6 b

 3-7 c

 3-3 c

 3-1 b

 3-2 a

 3-3 b

 3-4 d

 3-5 a

 3-6 c

 3-7 d

 3-3 c

2.3 Output

 3-1 a

 3-1 b

 3-2 a

 3-2 b

 3-3 b

 3-3 c

 3-4 d

 3-5 a

 3-6 b

 3-6 c

 3-7 c

 3-7 d

2.4 Solution

 /**

  *  Licensed under the Apache License, Version 2.0 (the "License");

  *  you may not use this file except in compliance with the License.

  *  You may obtain a copy of the License at

  *

  *      http://www.apache.org/licenses/LICENSE-2.0

  *

  *  Unless required by applicable law or agreed to in writing, software

  *  distributed under the License is distributed on an "AS IS" BASIS,

  *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  *  See the License for the specific language governing permissions and

  *  limitations under the License.

  */    

 package org.apache.hadoop.examples;

 import java.io.File;

 import java.io.IOException;

 import java.util.StringTokenizer;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.GenericOptionsParser;

 public class WordCount {

  public static class Map extends Mapper<Object,Text,Text,Text>{//map最后一个指定Text

      public static Text lineWords= new Text();

      //map输出为<Text,Text>,因为只涉及到是否Key存在的问题，故value可任意

      public void map(Object key,Text value,Context context)

              throws IOException, InterruptedException{

          lineWords = value;

          context.write(lineWords, new Text(""));//<Text,Text>

      }

  }

  public static class Reduce extends Reducer<Text,Text,Text,Text>{

      public void reduce(Text key,Iterable<Text> values,Context context)

              throws IOException, InterruptedException{

          context.write(key,new Text(""));

      }

  }

  public static void main(String args[])

          throws IOException, ClassNotFoundException, InterruptedException{

      Configuration conf = new Configuration();

      String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();

      if( otherArgs.length!=2 ){

          System.err.println("Usage: Data Deduplication <in> <out>");

          System.exit(2);

      }

      //删除已存在的输出文件夹

      judgeFileExist(otherArgs[1]);

      Job job = new Job(conf,"Data Dup");

      job.setJarByClass(WordCount.class);

      //设置map combine reduce处理类

      job.setMapperClass(Map.class);

      job.setCombinerClass(Reduce.class);

      job.setReducerClass(Reduce.class);

      //设置key value的类型

      job.setOutputKeyClass(Text.class);

      job.setOutputValueClass(Text.class);

      //设置输入和输出目录

      FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

      FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

      System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

   //删除文件夹及其目录下的文件

   public static void judgeFileExist(String path){

       File file = new File(path);

       if( file.exists() ){

           deleteFileDir(file);

       }

   }

   public static void deleteFileDir(File path){

       if( path.isDirectory() ){

           String[] files = path.list();

           for( int i=0;i<files.length;i++ ){

               deleteFileDir( new File(path,files[i]) );

           }

       }

       path.delete();

   }

 }

3. 数据排序

3.1 Description

给多个文件的数据排序，每个文件中的每个数据占一行

3.2 Sample

3.3 Output

3.4 Solution

 /**

  *  Licensed under the Apache License, Version 2.0 (the "License");

  *  you may not use this file except in compliance with the License.

  *  You may obtain a copy of the License at

  *

  *      http://www.apache.org/licenses/LICENSE-2.0

  *

  *  Unless required by applicable law or agreed to in writing, software

  *  distributed under the License is distributed on an "AS IS" BASIS,

  *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  *  See the License for the specific language governing permissions and

  *  limitations under the License.

  */    

 package org.apache.hadoop.example;

 import java.io.File;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.GenericOptionsParser;

 public class dataSort{

     public static class map extends Mapper<Object,Text,IntWritable,IntWritable>{

         private static IntWritable data = new IntWritable();

         String lineWords = new String();

         //map

         public void map(Object key,Text value,Context context)

                 throws IOException, InterruptedException{

             lineWords = value.toString();

             data.set(Integer.parseInt(lineWords));

             context.write(data,new IntWritable(1));

         }

     }

     public static class reduce extends Reducer<IntWritable, IntWritable,IntWritable,IntWritable>{

         private static IntWritable lineNum = new IntWritable(1);

         public void reduce(IntWritable key,Iterable<IntWritable> values,Context context)

                 throws IOException, InterruptedException{

             for(IntWritable val:values){

                 context.write(lineNum,key);

                 lineNum = new IntWritable(lineNum.get()+1);

             }

         }

     }

     public static void main(String args[])

             throws IOException, ClassNotFoundException, InterruptedException{

         Configuration conf = new Configuration();

          String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();

          if( otherArgs.length!=2 ){

              System.err.println("Usage: Data Deduplication <in> <out>");

              System.exit(2);

          }

          //删除已存在的输出文件夹

          judgeFileExist(otherArgs[1]);

          Job job = new Job(conf,"Data Dup");

          job.setJarByClass(dataSort.class);

          //设置map combine reduce处理类

          job.setMapperClass(map.class);

          job.setCombinerClass(reduce.class);

          job.setReducerClass(reduce.class);

          //设置key value的类型

          job.setOutputKeyClass(IntWritable.class);

          job.setOutputValueClass(IntWritable.class);

          //设置输入和输出目录

          FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

          FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

          System.exit(job.waitForCompletion(true) ? 0 : 1);

     }

     //删除文件夹及其目录下的文件

       public static void judgeFileExist(String path){

           File file = new File(path);

           if( file.exists() ){

               deleteFileDir(file);

           }

       }

       public static void deleteFileDir(File path){

           if( path.isDirectory() ){

               String[] files = path.list();

               for( int i=0;i<files.length;i++ ){

                   deleteFileDir( new File(path,files[i]) );

               }

           }

           path.delete();

       }

 }

MapReduce实例的更多相关文章

MapReduce实例2（自定义compare、partition）& shuffle机制
MapReduce实例2(自定义compare.partition)& shuffle机制实例:统计流量有一份流量数据,结构是:时间戳.手机号.....上行流量.下行流量,需求是统计每个用 ...
MapReduce实例&YARN框架
MapReduce实例&YARN框架一个wordcount程序统计一个相当大的数据文件中,每个单词出现的个数. 一.分析map和reduce的工作 map: 切分单词遍历单词数据输出 r ...
MapReduce实例浅析
在文章<MapReduce原理与设计思想>中,详细剖析了MapReduce的原理,这篇文章则通过实例重点剖析MapReduce 本文地址:http://www.cnblogs.com/ar ...
MapReduce实例-基于内容的推荐（一）
环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境数据:下载的amazon产品共同采购网络元数据(需FQ下载)http://snap.stanford.edu/data/ ...
MapReduce实例-倒排索引
环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境数据:任意数量.格式的文本文件(我用的四个.java代码文件) 方案目标: 根据提供的文本文件,提取出每个单词在哪个文件 ...
MapReduce实例-NASA博客数据频度简单分析
环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境,gnuplot, 数据:http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.htm ...
MapReduce实例——求平均值，所得结果无法写出到文件的错误原因及解决方案
1.错误原因 mapreduce按行读取文本,map需要在原有基础上增加一个控制语句,使得读到空行时不执行write操作,否则reduce不接受,也无法输出到新路径. 2.解决方案原错误代码 pub ...
MapReduce实例(数据去重)
数据去重: 原理(理解):Mapreduce程序首先应该确认<k3,v3>,根据<k3,v3>确定<k2,v2>,原始数据中出现次数超过一次的数据在输出文件中只出现 ...
MapReduce实例——查询缺失扑克牌
问题: 解决: 首先分为两个过程,Map过程将<=10的牌去掉,然后只针对于>10的牌进行分类,Reduce过程,将Map传过来的键值对进行统计,然后计算出少于3张牌的的花色 1.代码 1 ...

随机推荐

Oracle用户密码过期问题解决
一.用户密码即将过期,导致autotrace无法打开如果用户密码即将过期,在登录数据库时会收到如下提示: ERROR: ORA-2800 ...
WCF 已超过传入消息(65536)的最大消息大小配额。若要增加配额，请使用相应绑定元素上的 MaxReceivedMessageSize 属性
我出现这个问题主要是服务器返回数据量过大引起了,需要客户端服务端都要进行配置:我会说其实有神器的么....(工具=>wcf服务配置编辑器),用工具编辑下,就会完全搞定这个问题,再也不用纠结了服 ...
联想Z470安装10.11懒人版成功！！特此分享！！
折腾黑苹果也断断续续好几个月了,在远景也爬了好多贴,遇到问题基本上靠自己解决,自己组的台式机已基本完美,大学期间买的联想Z470现在是“食之无味,弃之可惜”,想想也来试试装个黑苹果玩玩,之前装过10. ...
一位iOS教育类应用开发者是如何赚到60多万美元？
注:伯乐在线12月19日在@程序员的那些事微博推荐了此文的英文原文,非常感谢@dotSlash 的翻译. 转眼距我写<我如何在iOS教育类应用中赚到20万美元>这篇博文已经一年多了,它 ...
【转载】VGA时序与原理
显示器扫描方式分为逐行扫描和隔行扫描:逐行扫描是扫描从屏幕左上角一点开始,从左像右逐点扫描,每扫描完一行,电子束回到屏幕的左边下一行的起始位置,在这期间,CRT对电子束进行消隐,每行结束时,用行同步信 ...
matlab求距一个数最近的奇（偶）数
int_a = floor(a);minEven = int_a+mod(int_a,2); %最近偶数minOdd = int_a+1-mod(int_a,2); %最近奇数
Windows完成端口网络模型
GetQueuedCompletionStatus 比如此时端口上完成的是什么操作,数据是什么等,还有,系统如何做到自动填充上述的结构的,也就是说,系统怎么知道在Overlap->OpCode ...
Android -- 获取汉字的首字母
转换获取一个汉 ...
使用OutputDebugString输出调试信息
在编写控制台程序的时候我们经常会使用printf输出调试信息,使我们了解程序的状态,方便调试,但是当编写非控制台程序的时候这种方法就行不通了,那我们应该怎么办?上网查了一些方法,大致就如下几种使用L ...
MVC缓存技术
一.MVC缓存简介缓存是将信息(数据或页面)放在内存中以避免频繁的数据库存储或执行整个页面的生命周期,直到缓存的信息过期或依赖变更才再次从数据库中读取数据或重新执行页面的生命周期.在系统优化过程中, ...

MapReduce实例

MapReduce实例的更多相关文章

随机推荐

热门专题