Hadoop MapReduce编程 API入门系列之MapReduce多种输出格式分析（十九）

　　不多说，直接上代码。

　　假如这里有一份邮箱数据文件，我们期望统计邮箱出现次数并按照邮箱的类别，将这些邮箱分别输出到不同文件路径下。

代码版本1

 package zhouls.bigdata.myMapReduce.Email;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 //通过MultipleOutputs写到多个文件：参考博客http://www.cnblogs.com/codeOfLife/p/5452902.html

 //    MultipleOutputs 类可以将数据写到多个文件，这些文件的名称源于输出的键和值或者任意字符串。

 //  这允许每个 reducer（或者只有 map 作业的 mapper）创建多个文件。 采用name-m-nnnnn 形式的文件名用于 map 输出，name-r-nnnnn 形式的文件名用于 reduce 输出，

 //  其中 name 是由程序设定的任意名字， nnnnn 是一个指明块号的整数（从 0 开始）。块号保证从不同块（mapper 或 reducer）输出在相同名字情况下不会冲突。

 public class Email extends Configured implements Tool {

     public static class MailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

         private final static IntWritable one = new IntWritable(1);

         @Override

         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             context.write(value, one);

         }

     }

     public static class MailReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

         private IntWritable result = new IntWritable();

         private MultipleOutputs< Text, IntWritable> multipleOutputs;

         @Override

         protected void setup(Context context) throws IOException ,InterruptedException{

             multipleOutputs = new MultipleOutputs<Text, IntWritable>(context);

         }

         protected void reduce(Text Key, Iterable<IntWritable> Values,Context context) throws IOException, InterruptedException {

             int begin = Key.toString().indexOf("@");//indexOf方法返回一个整数值，指出 String 对象内子字符串的开始位置。

             int end = Key.toString().indexOf(".");//indexOf方法返回一个整数值，指出 String 对象内子字符串的开始位置。只不过我们自己写出个end变量而已

 //            Key.toString().indexOf(ch)

 //            Key.toString().indexOf(str)

 //            Key.toString().indexOf(ch, fromIndex)

 //            Key.toString().indexOf(str, fromIndex)

 //            Key.toString().intern()

 //            Java中字符串中子串的查找共有四种方法，如下：

 //            1、int indexOf(String str) ：返回第一次出现的指定子字符串在此字符串中的索引。

 //            2、int indexOf(String str, int startIndex)：从指定的索引处开始，返回第一次出现的指定子字符串在此字符串中的索引。

 //            3、int lastIndexOf(String str) ：返回在此字符串中最右边出现的指定子字符串的索引。

 //            4、int lastIndexOf(String str, int startIndex) ：从指定的索引处开始向后搜索，返回在此字符串中最后一次出现的指定子字符串的索引。

             if(begin>=end){

                 return;

             }

             //获取邮箱类别，比如 qq

             String name = Key.toString().substring(begin+1, end);

 //                        String.subString(start,end)截取的字符串包括起点所在的字符串，不包括终点所在的字符串

             int sum = 0;

             for (IntWritable value : Values) {

                 sum += value.get();

             }

             result.set(sum);

             multipleOutputs.write(Key, result, name);

                         //这里，我们用到的是multipleOutputs.write(Text key, IntWritable value, String baseOutputPath);

 //            multipleOutputs.write默认有3种构造方法：

 //            multipleOutputs.write(String namedOutput, K key, V value);

 //            multipleOutputs.write(Text key, IntWritable value, String baseOutputPath);

 //            multipleOutputs.write(String namedOutput, K key, V value,String  baseOutputPath);

 //            MultipleOutputs 类可以将数据写到多个文件，这些文件的名称源于输出的键和值或者任意字符串。

 //            这允许每个 reducer（或者只有 map 作业的 mapper）创建多个文件。

 //             采用name-m-nnnnn 形式的文件名用于 map 输出，name-r-nnnnn 形式的文件名用于 reduce 输出，

 //             其中 name 是由程序设定的任意名字，

 //            nnnnn 是一个指明块号的整数（从 0 开始）。

 //             块号保证从不同块（mapper 或 reducer）写的输出在相同名字情况下不会冲突。

         }

         @Override

         protected void cleanup(Context context) throws IOException ,InterruptedException{

             multipleOutputs.close();

         }

     }

     public int run(String[] args) throws Exception {

         Configuration conf = new Configuration();// 读取配置文件

         Path mypath = new Path(args[1]);

         FileSystem hdfs = mypath.getFileSystem(conf);//创建输出路径

         if (hdfs.isDirectory(mypath)) {

             hdfs.delete(mypath, true);

         }

         Job job = Job.getInstance();// 新建一个任务

         job.setJarByClass(Email.class);// 主类

         FileInputFormat.addInputPath(job, new Path(args[0]));// 输入路径

         FileOutputFormat.setOutputPath(job, new Path(args[1]));// 输出路径

         job.setMapperClass(MailMapper.class);// Mapper

         job.setReducerClass(MailReducer.class);// Reducer

         job.setOutputKeyClass(Text.class);// key输出类型

         job.setOutputValueClass(IntWritable.class);// value输出类型

         job.waitForCompletion(true);

         return 0;

     }

     public static void main(String[] args) throws Exception {

         String[] args0 = {

                 "hdfs://HadoopMaster:9000/inputData/multipleOutputFormats/mail.txt",

                 "hdfs://HadoopMaster:9000/outData/MultipleOutputFormats/" };

         int ec = ToolRunner.run(new Configuration(), new Email(), args0);

         System.exit(ec);

     }

 }

代码版本1

 package zhouls.bigdata.myMapReduce.Email;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.IntWritable;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 //假如这里有一份邮箱数据文件，我们期望统计邮箱出现次数并按照邮箱的类别，将这些邮箱分别输出到不同文件路径下。

 /*wolys@21cn.com

 zss1984@126.com

 294522652@qq.com

 simulateboy@163.com

 zhoushigang_123@163.com

 sirenxing424@126.com

 lixinyu23@qq.com

 chenlei1201@gmail.com

 370433835@qq.com

 cxx0409@126.com

 viv093@sina.com

 q62148830@163.com

 65993266@qq.com

 summeredison@sohu.com

 zhangbao-autumn@163.com

 diduo_007@yahoo.com.cn

 fxh852@163.com

 /out/163-r-00000

 /out/126-r-00000

 /out/21cn-r-00000

 /out/gmail-r-00000

 /out/qq-r-00000

 /out/sina-r-00000

 /out/sohu-r-00000

 /out/yahoo-r-00000

 /out/part-r-00000

 */

 public class Email extends Configured implements Tool{

     public static class MailMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

         private final static IntWritable one = new IntWritable(1);//赋值1给one

         @Override

         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             context.write(value, one);//将value和one写入到context里。    value是k2,one是v2

 //            context.write(new Text(value),new IntWritable(one));等价

 //            key默认是行偏移量，可以自己自定义改

         }

     }

 //    MultipleOutputs将结果输出到多个文件或文件夹的步骤：

 //    见博客http://tydldd.iteye.com/blog/2053867

     public static class MailReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

         private IntWritable result = new IntWritable();

         private MultipleOutputs<Text, IntWritable> multipleOutputs;//MultipleOutputs将结果输出到多个文件或文件夹

 //        因为，MultipleOutputs是将结果输出到多个文件或文件夹，那么结果是什么，则就是k3,v3啦。即在这里就是MultipleOutputs<Text, IntWritable> multipleOutputs;

         //创建对象,以下是模板，别怕

         protected void setup(Context context) throws IOException ,InterruptedException{

             multipleOutputs = new MultipleOutputs<Text, IntWritable>(context);

         }

         protected void reduce(Text Key, Iterable<IntWritable> Values,Context context) throws IOException, InterruptedException{

         //294522652@qq.com

             int begin = Key.toString().indexOf("@");//indexOf() 方法可返回某个指定的字符串值在字符串中首次出现的位置。 即begin是9

             int end = Key.toString().indexOf(".");//indexOf() 方法可返回某个指定的字符串值在字符串中首次出现的位置。 即end是12

             if(begin>=end){

                 return;

             }

             //获取邮箱类别，比如 qq

             String name = Key.toString().substring(begin+1, end);//substring()是去除指定字符串的方法，及substring(10，12)

             int sum = 0;

             for (IntWritable value : Values) {//计数，for星型循环，即将Iterable<IntWritable> Values的值，一一传给IntWritable value

                 sum += value.get();//就是拿取IntWritable类型的value的值，给value类型的sum

             }

             result.set(sum);//即求和计数，如wolys@21cn.com出现了几次几次。

             multipleOutputs.write(Key, result, name);//将Key和result和name一起写入multipleOutputs

              /*

               * http://www.cnblogs.com/codeOfLife/p/5452902.html

              * multipleOutputs.write(key, value, baseOutputPath)方法的第三个函数表明了该输出所在的目录（相对于用户指定的输出目录）。

              * 如果baseOutputPath不包含文件分隔符"/"，那么输出的文件格式为baseOutputPath-r-nnnnn（name-r-nnnnn)；

              * 如果包含文件分隔符"/"，例如baseOutputPath="029070-99999/1901/part"，那么输出文件则为029070-99999/1901/part-r-nnnnn

              */

         }

         //关闭对象，以下是模板，别怕

         protected void cleanup(Context context) throws IOException ,InterruptedException{

             multipleOutputs.close();

         }

     }

     public int run(String[] arg0) throws Exception{

         Configuration conf = new Configuration();// 读取配置文件

         Path mypath = new Path(arg0[1]);//下标为1，即是输出路径

         FileSystem hdfs = mypath.getFileSystem(conf);//FileSystem对象hdfs

         if (hdfs.isDirectory(mypath))

          {

             hdfs.delete(mypath, true);

         }

         Job job = Job.getInstance();// 新建一个任务

         job.setJarByClass(Email.class);// 主类

         job.setMapperClass(MailMapper.class);// Mapper

         job.setReducerClass(MailReducer.class);// Reducer

         job.setOutputKeyClass(Text.class);// key输出类型

         job.setOutputValueClass(IntWritable.class);// value输出类型

         FileInputFormat.addInputPath(job, new Path(arg0[0]));// 文件输入路径

         FileOutputFormat.setOutputPath(job, new Path(arg0[1]));// 文件输出路径

         job.waitForCompletion(true);

         return 0;

     }

     public static void main(String[] args) throws Exception{

         //集群路径

 //        String[] args0 = { "hdfs://HadoopMaster:9000/email/email.txt",

 //                 "hdfs://HadoopMaster:9000/out/email"};

 //本地路径

         String[] args0 = { "./data/email/email.txt",

                  "out/email/"};            

         int ec = ToolRunner.run( new Configuration(), new Email(), args0);

         System. exit(ec);

     }

 }

Hadoop MapReduce编程 API入门系列之MapReduce多种输出格式分析（十九）的更多相关文章

Hadoop MapReduce编程 API入门系列之MapReduce多种输入格式（十七）
不多说,直接上代码. 代码 package zhouls.bigdata.myMapReduce.ScoreCount; import java.io.DataInput; import java.i ...
Hadoop MapReduce编程 API入门系列之自定义多种输入格式数据类型和排序多种输出格式（十一）
推荐 MapReduce分析明星微博数据 http://git.oschina.net/ljc520313/codeexample/tree/master/bigdata/hadoop/mapredu ...
Hadoop MapReduce编程 API入门系列之Crime数据分析（二十五）（未完）
不多说,直接上代码. 一共12列,我们只需提取有用的列:第二列(犯罪类型).第四列(一周的哪一天).第五列(具体时间)和第七列(犯罪场所). 思路分析基于项目的需求,我们通过以下几步完成: 1.首先 ...
Hadoop MapReduce编程 API入门系列之网页排序（二十八）
不多说,直接上代码. Map output bytes=247 Map output materialized bytes=275 Input split bytes=139 Combine inpu ...
Hadoop MapReduce编程 API入门系列之二次排序（十六）
不多说,直接上代码. -- ::, INFO [org.apache.hadoop.metrics.jvm.JvmMetrics] - Initializing JVM Metrics with pr ...
Hadoop MapReduce编程 API入门系列之分区和合并（十四）
不多说,直接上代码. 代码 package zhouls.bigdata.myMapReduce.Star; import java.io.IOException; import org.apache ...
Hadoop MapReduce编程 API入门系列之压缩和计数器（三十）
不多说,直接上代码. Hadoop MapReduce编程 API入门系列之小文件合并(二十九) 生成的结果,作为输入源. 代码 package zhouls.bigdata.myMapReduce. ...
Hadoop MapReduce编程 API入门系列之挖掘气象数据版本3（九）
不多说,直接上干货! 下面,是版本1. Hadoop MapReduce编程 API入门系列之挖掘气象数据版本1(一) 下面是版本2. Hadoop MapReduce编程 API入门系列之挖掘气象数 ...
Hadoop MapReduce编程 API入门系列之挖掘气象数据版本2（十）
下面,是版本1. Hadoop MapReduce编程 API入门系列之挖掘气象数据版本1(一) 这篇博文,包括了,实际生产开发非常重要的,单元测试和调试代码.这里不多赘述,直接送上代码. MRUni ...

随机推荐

OpenCV边缘检测的详细参数调节
1. findCountours 转载于http://blog.sina.com.cn/s/blog_7155fb1a0101a90h.html findContours函数,这个函数的原型为: &l ...
mssql for VSCode Guide
前言 mssql 出自微软自己的 Visual Studio Code 开源插件,代码托管于 GitHub 上. 不过需要注意的一点是,使用 insert into 语句新增的数据...中文是会乱码的 ...
THREE.js代码备份——webgl - custom attributes [lines]（自定义字体显示、控制字图的各个属性）
<!DOCTYPE html> <html lang="en"> <head> <title>three.js webgl - cu ...
C# 检测字符串是否为数字
long n; 1. ], ].All(char.IsDigit); //识别空字符时候会认为是数字 string str0 = ""; string str1 = " ...
url 传参数时出现中文乱码
1.前端通过 url 传递参数,但是参数又有中文,在下一个页面接受参数的时候中文会乱码解决方案为: 定义和用法 decodeURI() 函数可对 encodeURI() 函数编码过的 URI 进行解 ...
js-url解析函数
//Url解析 function parseURL(url) { var a = document.createElement('a'); a.href = url; return { source: ...
PAT_A1098#Insertion or Heap Sort
Source: PAT_A1098 Insertion or Heap Sort (25 分) Description: According to Wikipedia: Insertion sort ...
【剑指Offer】20、包含min函数的栈
题目描述: 定义栈的数据结构,请在该类型中实现一个能够得到栈中所含最小元素的min函数(时间复杂度应为O(1)). 解题思路: 使用两个stack,一个为数据栈,另一个为辅助栈.数据栈 ...
【剑指Offer】18、二叉树的镜像
题目描述: 操作给定的二叉树,将其变换为原二叉树的镜像. 解题思路: 求一棵树的镜像的过程:先前序遍历这棵树的每个结点,如果遍历到的结点有子结点,就交换它的两个子结点.当交换完所有的非 ...
15.4 Task 异步匿名函数
Func<int, Task<int>> func = async x => { Console.WriteLine("starting x={0}" ...

Hadoop MapReduce编程 API入门系列之MapReduce多种输出格式分析（十九）

Hadoop MapReduce编程 API入门系列之MapReduce多种输出格式分析（十九）的更多相关文章

随机推荐

热门专题