mapreduce (四) MapReduce实现Grep+sort

1.txt

dong xi cheng

xi dong cheng

wo ai beijing

tian an men

qiche

dong

dong

dong

2.txt

dong xi cheng

xi dong cheng

wo ai beijing

tian an men

qiche

dong

dong

dong

import java.io.IOException;

import java.util.Random;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.map.InverseMapper;

import org.apache.hadoop.mapreduce.lib.map.RegexMapper;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

import org.apache.hadoop.mapreduce.lib.reduce.LongSumReducer;

public class IGrep {

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        Configuration conf = new Configuration();

        String dir_in = "hdfs://localhost:9000/input_grep";

        String dir_out = "hdfs://localhost:9000/output_grep";

        String reg = ".ng";//匹配三个字符的字符串，且以ng结尾。

        conf.set(RegexMapper.PATTERN, reg);

        conf.setInt(RegexMapper.GROUP, 0);

        Path in = new Path(dir_in);

        Path tmp = new Path("grep-temp-"

                + Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

        Path out = new Path(dir_out);

        try {

            Job grepJob = new Job(conf, "grep-search");

            grepJob.setJarByClass(IGrep.class);

            grepJob.setInputFormatClass(TextInputFormat.class);

            grepJob.setMapperClass(RegexMapper.class);

            grepJob.setCombinerClass(LongSumReducer.class);

            grepJob.setPartitionerClass(HashPartitioner.class);

            grepJob.setMapOutputKeyClass(Text.class);

            grepJob.setMapOutputValueClass(LongWritable.class);

            FileInputFormat.addInputPath(grepJob, in);

            grepJob.setReducerClass(LongSumReducer.class);

            // job.setNumReduceTasks(1);

            grepJob.setOutputFormatClass(SequenceFileOutputFormat.class);

            grepJob.setOutputKeyClass(Text.class);

            grepJob.setOutputValueClass(LongWritable.class);

            FileOutputFormat.setOutputPath(grepJob, tmp);

            grepJob.waitForCompletion(true);

            Job sortJob = new Job(conf, "grep-sort");

            sortJob.setJarByClass(IGrep.class);

            sortJob.setInputFormatClass(SequenceFileInputFormat.class);

            sortJob.setMapperClass(InverseMapper.class);

            FileInputFormat.addInputPath(sortJob, tmp);

            sortJob.setNumReduceTasks(1);【全局排序】

            sortJob.setSortComparatorClass(LongWritable.DecreasingComparator.class);//逆序

            FileOutputFormat.setOutputPath(sortJob, out);

            sortJob.waitForCompletion(true);

        } finally {

            FileSystem.get(conf).delete(tmp, true);

        }

    }

}

输出结果：
10    ong
4    eng
2    ing

mapreduce (四) MapReduce实现Grep+sort的更多相关文章

hadoop系列四:mapreduce的使用(二)
转载请在页首明显处注明作者与出处一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6 ...
MapReduce(四)
MapReduce(四) 1.shuffle过程 2.map中setup,map,cleanup的作用. 一.shuffle过程 https://blog.csdn.net/techchan/arti ...
mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次现重新实现一个: 思路 ...
mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次
1 思路:0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapRed ...
MapReduce:详解Shuffle(copy,sort,merge)过程（转）
Shuffle过程是MapReduce的核心,也被称为奇迹发生的地方.要想理解MapReduce, Shuffle是必须要了解的.我看过很多相关的资料,但每次看完都云里雾里的绕着,很难理清大致的逻辑, ...
MapReduce中的Shuffle和Sort分析
MapReduce 是现今一个非常流行的分布式计算框架,它被设计用于并行计算海量数据.第一个提出该技术框架的是Google 公司,而Google 的灵感则来自于函数式编程语言,如LISP,Scheme ...
Hadoop : MapReduce中的Shuffle和Sort分析
地址 MapReduce 是现今一个非常流行的分布式计算框架,它被设计用于并行计算海量数据.第一个提出该技术框架的是Google 公司,而Google 的灵感则来自于函数式编程语言,如LISP,Sch ...
MapReduce(四) 典型编程场景（二）
一.MapJoin-DistributedCache 应用 1.mapreduce join 介绍在各种实际业务场景中,按照某个关键字对两份数据进行连接是非常常见的.如果两份数据都比较小,那么可以 ...
Linux管线命令 - cut,grep,sort,uniq,wc,tee,tr,col,join,paste,expand,split,xargs
在每个管线后面接的第一个数据必定是『命令』喔!而且这个命令必须要能够接受 standard input 的数据才行,这样的命令才可以是为『管线命令』,例如 less, more, head, tail ...

随机推荐

怎样在delphi中实现控件的拖拽
下面这2种方法都能实现对控件和窗体的拖拽方法1 procedure TForm1.FormMouseDown(Sender: TObject; Button: TMouseButton; Shift ...
[RxJS] Transformation operators: debounce and debounceTime
Debounce and debounceTime operators are similar to delayWhen and delay, with the difference that the ...
颜色渐变的RGB计算
均匀渐变渐变(Gradient)是美学中一条重要的形式美法则,与其相对应的是突变.形状.大小.位置.方向.色彩等视觉因素都可以进行渐变.在色彩中,色相.明度.纯度也都可以产生渐变效果,并会表现出具有 ...
Java 调用Dll
Java 中怎么能调用到dll中的函数呢? 关键是java中生的本地函数名參数和dll中的本地函数名參数一模一样. 这个程序是java中调用dll中的求和函数. 一,java代码部分操作 1.新建pr ...
Active Desktop--桌面字体背景被修改
怎么修改回来步骤如下方法一.在桌面上点击右键 -- 排列图标 -- 去掉“在桌面上锁定Web项目”上的勾. 方法二.右键点击我的电脑 -- 属性 -- 高级 -- 点击“性能”下面的“设置”按钮, ...
表达式：使用API创建表达式树（3）
一.DebugInfoExpression:发出或清除调试信息的序列点. 这允许调试器在调试时突出显示正确的源代码. static void Main(string[] args) { var asm ...
HTML5 canvas createRadialGradient()放射状/圆形渐变
定义和用法 createLinearGradient() 方法创建放射状/圆形渐变对象. 渐变可用于填充矩形.圆形.线条.文本等等. 提示:请使用该对象作为 strokeStyle 或 fillSty ...
angularjs-ngModel传值问题
js NiDialog.open({ windowClass: '', backdrop: 'static', keyboard: false, templateUrl: '/static/tpl/a ...
SQL Server死锁日志各字段含义
使用跟踪标记 1204 --打开跟踪标记 DBCC TRACEON (1204,-1) --关闭跟踪标记 DBCC TRACEOFF (1204,-1) 处于死锁状态时,跟踪标记 1204 在等待的线 ...
memcached和mongodb 在windows下安装
要在新机器上安装memcached和mongodb服务,折腾了一天,终于把这两个服务在windows下跑起来了. memcached主要参考http://www.rootop.org/pages/27 ...

mapreduce (四) MapReduce实现Grep+sort

mapreduce (四) MapReduce实现Grep+sort的更多相关文章

随机推荐

热门专题