java写hadoop全局排序

前言：

一直不会用java,都是streaming的方式用C或者python写mapper或者reducer的可执行程序。但是有些情况，如全排序等等用streaming的方式往往不好处理，于是乎用原生语言来写map-reduce;

开发环境eclipse,windows,把hadoop相关的jar附加到程序中，打包后放回linux虚机执行；

输入数据

1 haha    10
2 haha    9
3 haha    100
4 haha    1
5 haha    1
6 haha    2
7 haha    3
8 haha    1000
9 haha    1000
10 haha    999
11 haha    888
12 haha    10000

输出数据 cat part*-*>o.txt

1 haha    1
2 haha    1
3 haha    2
4 haha    3
5 haha    9
6 haha    10
7 haha    100
8 haha    888
9 haha    999
10 haha    1000
11 haha    1000
12 haha    10000

代码 MyMapper

package com.globalsort;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends

   Mapper<LongWritable, Text, LongWritable, Text> {

        @Override

        protected void map(LongWritable  key, Text value, Context context)

                        throws IOException, InterruptedException {

        		String temp=value.toString();

        		String[] segs = temp.split("\t");

        		if (segs.length!=2)

        		{

        			return;

        		}

        		int newval = Integer.parseInt(segs[1]);

                context.write(new LongWritable(newval),

                                new Text(segs[0]));

        }

}

重写reducer

package com.globalsort;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import java.util.Iterator;  

public class MyReducer extends

                Reducer<LongWritable, Text,Text,LongWritable > {

        @Override

        protected void reduce(LongWritable key, Iterable<Text> values,

                        Context context) throws IOException, InterruptedException {

        	Iterator<Text> it = values.iterator();

        	while (it.hasNext())

        	{

        		String data = it.next().toString();

                context.write(new Text(data),key);

        	}

        }

}

　重写patitioner

package com.globalsort;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<LongWritable, Text> {

        @Override

        public int getPartition(LongWritable key, Text value, int numPartitions) {

                long tmp = key.get();

                if (tmp <= 100) {

                        return 0 % numPartitions;

                } else if (tmp <= 1000) {

                        return 1 % numPartitions;

                } else {

                        return 2 % numPartitions;

                }  

        }

}

　　runer

package com.globalsort;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.compress.CompressionCodec;

import org.apache.hadoop.io.compress.GzipCodec;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;  

public class GlobalSortMain implements Tool {

	private Configuration conf;

	@Override

	public Configuration getConf() {

		return conf;

	}

    @Override

    public void setConf (Configuration conf){

    	this.conf=conf;

    }

    @Override

    public int run(String[] args) throws Exception {

    	String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    	if (otherArgs.length != 3) {

    		 System.err.println("Usage:  must contain <in> <out>");

    	 }

    		Job job = configureJob(otherArgs);

    	 return (job.waitForCompletion(true) ? 0 : 1);

    }

    private Job configureJob(String[] args) throws IOException {

    	conf.set("mapred.job.priority", "VERY_HIGH");

    //	conf.setBoolean("mapred.compress.map.output", true);

    	//conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);

     //	conf.setBoolean("mapred.compress.reduce.output", true);

    	//conf.setClass("mapred.reduce.output.compression.codec", GzipCodec.class, CompressionCodec.class);

        Job job = new Job(conf, "global sort liuyu");

        job.setJarByClass(GlobalSortMain.class);

        job.setMapperClass(MyMapper.class);

        job.setReducerClass(MyReducer.class);

        job.setPartitionerClass(MyPartitioner.class);

        job.setNumReduceTasks(3);

        job.setMapOutputKeyClass(LongWritable.class);

        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[1]));

        FileOutputFormat.setOutputPath(job, new Path(args[2]));

    	return job;

    }

        public static void main(String[] args) throws Exception {

                Configuration conf = new Configuration();

                ToolRunner.run(conf, new GlobalSortMain(), args);

        }

}

java写hadoop全局排序的更多相关文章

三种方法实现Hadoop(MapReduce)全局排序(1)
我们可能会有些需求要求MapReduce的输出全局有序,这里说的有序是指Key全局有序.但是我们知道,MapReduce默认只是保证同一个分区内的Key是有序的,但是不保证全局有序.基于此,本文提供三 ...
一起学Hadoop——TotalOrderPartitioner类实现全局排序
Hadoop排序,从大的范围来说有两种排序,一种是按照key排序,一种是按照value排序.如果按照value排序,只需在map函数中将key和value对调,然后在reduce函数中在对调回去.从小 ...
Hadoop对文本文件的快速全局排序
一.背景 Hadoop中实现了用于全局排序的InputSampler类和TotalOrderPartitioner类,调用示例是org.apache.hadoop.examples.Sort. 但是当 ...
Mapreduce的排序（全局排序、分区加排序、Combiner优化）
一.MR排序的分类 1.部分排序:MR会根据自己输出记录的KV对数据进行排序,保证输出到每一个文件内存都是经过排序的: 2.全局排序: 3.辅助排序:再第一次排序后经过分区再排序一次: 4.二次排序: ...
MapReduce TotalOrderPartitioner 全局排序
我们知道Mapreduce框架在feed数据给reducer之前会对map output key排序,这种排序机制保证了每一个reducer局部有序,hadoop 默认的partitioner是Has ...
大数据mapreduce全局排序top-N之python实现
a.txt.b.txt文件如下: a.txt hadoop hadoop hadoop hadoop hadoop hadoop hadoop hadoop hadoop hadoop hadoop ...
一起学Hadoop——使用自定义Partition实现hadoop部分排序
排序在很多业务场景都要用到,今天本文介绍如何借助于自定义Partition类实现hadoop部分排序.本文还是使用java和python实现排序代码. 1.部分排序. 部分排序就是在每个文件中都是有序 ...
MapReduce怎么优雅地实现全局排序
思考想到全局排序,是否第一想到的是,从map端收集数据,shuffle到reduce来,设置一个reduce,再对reduce中的数据排序,显然这样和单机器并没有什么区别,要知道mapreduce框 ...
JAVA之旅（三十五）——完结篇，终于把JAVA写完了，真感概呐！
JAVA之旅(三十五)--完结篇,终于把JAVA写完了,真感概呐! 这篇博文只是用来水经验的,写这个系列是因为我自己的java本身也不是特别好,所以重温了一下,但是手比较痒于是就写出了这三十多篇博客了 ...

随机推荐

First insmod a module
不得不说网上坑爹的文章比虱子还多,参考这位仁兄调试成功喜欢C的人却靠着Java产业吃饭,人艰不拆... 对于未知的东西,有个习惯,run success first,then research en ...
struts2-通配符映射（基本没啥卵用）和动态调用
通配符使用*代表任意字符一般在action的name中使用*,并可以使用多个可以使用{通配符的序号}引用对应的通配符所代表的值,序号从1开始 {0}代表整个URI 匹配规则首先完全匹配,没有完 ...
如何在page_load方法判断是服务器端控件引发的page_load方法
动态获取单击的服务器端控件的id值 private string getPostBackControlName() { Control control=null; s ...
自定义AlertView实现模态对话框
在Windows应用程序中,经常使用模态(Model)对话框来和用户进行简单的交互,比如登录框.在IOS应用程序中,有时我们也希望做同样的事情.但IOS的UI库中,没有模态对话框,最接近那个样子的应该 ...
UBI系统原理分析【转】
转自:http://blog.chinaunix.net/uid-28236237-id-4164656.html 综述 UBI全称Unsorted Block Images,是一种原始flash设备 ...
iOS9适配之关于info.plist 第三方登录添加URL Schemes白名单
近期苹果公司iOS 9系统策略更新,限制了http协议的访问,此外应用需要在“Info.plist”中将要使用的URL Schemes列为白名单,才可正常检查其他应用是否安装. 受此影响,当你的应用在 ...
Frost R&D
Trees Procedural Math Model in Houdini,render with Mantra. Shader use SurfaceModel With Other Attrib ...
Sublime Text怎么在切分两行视口内显示同一个文件
原文链接:http://devlog.en.alt-area.org/?p=1098 How to split one file into two views in Sublime Text2 You ...
Unix网络编程--卷二：FAQ
1.编译unpipc库. 执行./configure时报错: checking host system type... Invalid configuration `x86_64-pc-linux-g ...
(转)函数调用方式与extern "C"
原文:http://patmusing.blog.163.com/blog/static/13583496020103233446784/ (VC编译器下) 1. CALLBACK,WINAPI和AF ...

java写hadoop全局排序

java写hadoop全局排序的更多相关文章

随机推荐

热门专题