mapreduce (六) MapReduce实现去重 NullWritable的使用

习题来源：http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html
file1

2012-3-1 a

2012-3-2 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-7 c

2012-3-3 c

file2

2012-3-1 b

2012-3-2 a

2012-3-3 b

2012-3-4 d

2012-3-5 a

2012-3-6 c

2012-3-7 d

2012-3-3 c

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyDedup {

    public static class LineNullMapper extends Mapper<Object, Text, Text, NullWritable>{

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{

            context.write(value, NullWritable.get());

        }

    }

    public static class SortReducer extends Reducer<Text, NullWritable, Text, NullWritable>{

        public void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException{

            context.write(key, NullWritable.get());

        }

    }

如果把Iterable<NullWritable> values 替换为 NullWritable values 如果是不用Iterable迭代器的话，则不进行分组么？
结果是：只排序了并没有完成去重
2012-3-1 a

2012-3-1 b

2012-3-2 a

2012-3-2 b

2012-3-3 b

2012-3-3 c

2012-3-3 c

2012-3-3 c

2012-3-4 d

2012-3-4 d

2012-3-5 a

2012-3-5 a

2012-3-6 b

2012-3-6 c

2012-3-7 c

2012-3-7 d

public static void main(String[] args) throws Exception {

        String dir_in = "hdfs://localhost:9000/in_dedup";

        String dir_out = "hdfs://localhost:9000/out_dedup";

        Path in = new Path(dir_in);

        Path out = new Path(dir_out);

        Configuration conf = new Configuration();

        Job sortJob = new Job(conf, "my_dedup");

        sortJob.setJarByClass(MyDedup.class);

        sortJob.setInputFormatClass(TextInputFormat.class);

        sortJob.setMapperClass(LineNullMapper.class);

        sortJob.setCombinerClass(SortReducer.class);

        //countJob.setPartitionerClass(HashPartitioner.class);

        sortJob.setMapOutputKeyClass(Text.class);

        sortJob.setMapOutputValueClass(NullWritable.class);

        FileInputFormat.addInputPath(sortJob, in);

        sortJob.setReducerClass(SortReducer.class);

        // countJob.setNumReduceTasks(1);

        sortJob.setOutputKeyClass(Text.class);

        sortJob.setOutputValueClass(NullWritable.class);

        //countJob.setOutputFormatClass(SequenceFileOutputFormat.class);

        FileOutputFormat.setOutputPath(sortJob, out);

        sortJob.waitForCompletion(true);

    }

}

运行结果：
2012-3-1 a

2012-3-1 b

2012-3-2 a

2012-3-2 b

2012-3-3 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-6 c

2012-3-7 c

2012-3-7 d

mapreduce (六) MapReduce实现去重 NullWritable的使用的更多相关文章

Hadoop阅读笔记（二）——利用MapReduce求平均数和去重
前言:圣诞节来了,我怎么能虚度光阴呢?!依稀记得,那一年,大家互赠贺卡,短短几行字,字字融化在心里:那一年,大家在水果市场,寻找那些最能代表自己心意的苹果香蕉梨,摸着冰冷的水果外皮,内心早已滚烫.这一 ...
mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次现重新实现一个: 思路 ...
mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次
1 思路:0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapRed ...
MapReduce编程：单词去重
编程实现单词去重要用到NullWritable类型. NullWritable: NullWritable 是一种特殊的Writable 类型,由于它的序列化是零长度的,所以没有字节被写入流或从流中读 ...
实验六 MapReduce实验：二次排序
实验指导: 6.1 实验目的基于MapReduce思想,编写SecondarySort程序. 6.2 实验要求要能理解MapReduce编程思想,会编写MapReduce版本二次排序程序,然后将其执行 ...
hadoop —— MapReduce例子（数据去重）
参考:http://eric-gcm.iteye.com/blog/1807468 例子1: 概要:数据去重描述:将file1.txt.file2.txt中的数据合并到一个文件中的同时去掉重复的内容 ...
MapReduce(一) mapreduce基础入门
一.mapreduce入门 1.什么是mapreduce 首先让我们来重温一下 hadoop 的四大组件:HDFS:分布式存储系统MapReduce:分布式计算系统YARN: hadoop 的资源调度 ...
mapreduce (四) MapReduce实现Grep+sort
1.txt dong xi cheng xi dong cheng wo ai beijing tian an men qiche dong dong dong 2.txt dong xi cheng ...
mapreduce (三) MapReduce实现倒排索引(二)
hadoop api http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Reducer.html 改变一下需求: ...

随机推荐

Myeclipse2013 SVN安装方法
1. 打开Help下的Install from Site 2. 弹出窗口,如下图: 3. 点击Add标签,如图: 在对话框Name输入Svn, URL中输入:http://subclipse.tigr ...
未能从程序集“WindowsBase, Version=3.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35“ 中加载“System.Windows.SplashSceen”
通过添加windowsbase.dll,可以解决这个问题,你可以在自己的电脑上找到这个文件,地址是:C:\Program Files\Reference Assemblies\Microsoft\Fr ...
[React + Mobx] Mobx and React intro: syncing the UI with the app state using observable and observer
Applications are driven by state. Many things, like the user interface, should always be consistent ...
linux的文件系统及节点表
linux的文件系统及节点表一 linux的文件系统1 我们都知道当我们安装linux时会首先给系统分区,然后我们会把分区格式化成EXT3格式的文件系统.那么在linux系统中还有没有其他的文件系 ...
PHP编译安装出错configure: error: mcrypt.h not found. Please reinstall libmcrypt的解决办法
1.下载libmcrypt wget http://jaist.dl.sourceforge.net/project/mcrypt/Libmcrypt/2.5.8/libmcrypt-2.5.8.ta ...
linux中history命令使用与配置
history中设置显示命令的执行时间 vi /root/.bashrc HISTTIMEFORMAT="%Y-%M-%D %H:%M:%S" export HISTTIMEFOR ...
poj 3565 ants
/* poj 3565 递归分治还有用KM的做法这里写的分治按紫书上的方法不过那里说的有点冗杂了可以简化一下首先为啥可以分治也就是分成子问题解决只要有一个集合黑白的个数相等就一定能 ...
（转）兼容主流浏览器的CSS透明代码
透明往往能产生不错的网页视觉效果下面是兼容主流浏览器的CSS透明代码.transparent_class { filter:alpha(opacity=50); -moz-opacity:0.5; - ...
原生JS+tween.js模仿微博发布效果
转载请注明出处:http://www.cnblogs.com/zhangmingze/p/4816865.html 1.先看效果吧,有效果才有动力: 2.html结构: <!DOCTYPE ht ...
转 sqlserver字段描述相关操作sql
可以自己查询系统表: SELECT o.name AS tableName, c.name AS columnName, p.[value] AS Description FROM sysproper ...

mapreduce (六) MapReduce实现去重 NullWritable的使用

mapreduce (六) MapReduce实现去重 NullWritable的使用的更多相关文章

随机推荐

热门专题