hadoop 使用Avro排序

在上例中，使用Avro框架求出数据的最大值，本例使用Avro对数据排序，输入依然是之前的样本，输出使用文本（也可以输出Avro格式）。

1、在Avro的Schema中直接设置排序方向。

dataRecord.avsc，放入resources目录下：

{

    "type":"record",

    "name":"WeatherRecord",

    "doc":"A weather reading",

    "fields":[

        {"name":"year","type":"int"},

        {"name":"temperature","type":"int","order":"descending"}

    ]

}

原常量类：

public class AvroSchemas {

    private Schema currentSchema;

    //本例中不使用常量，修改成资源中加载

    public static final Schema SCHEMA = new Schema.Parser().parse("{\n" +

            "\t\"type\":\"record\",\n" +

            "\t\"name\":\"WeatherRecord\",\n" +

            "\t\"doc\":\"A weather reading\",\n" +

            "\t\"fields\":[\n" +

            "\t\t{\"name\":\"year\",\"type\":\"int\"},\n" +

            "\t\t{\"name\":\"temperature\",\"type\":\"int\",\"order\":\"descending\"}\n" +

            "\t]\t\n" +

            "}");

    public AvroSchemas() throws IOException {

        Schema.Parser parser = new Schema.Parser();

        //采用从资源文件中读取Avro数据格式

        this.currentSchema = parser.parse(getClass().getResourceAsStream("dataRecord.avsc"));

    }

    public Schema getCurrentSchema() {

        return currentSchema;

    }

}

2、mapper

public class AvroMapper extends Mapper<LongWritable,Text,AvroKey<GenericRecord>,AvroValue<GenericRecord>> {

    private RecordParser parser = new RecordParser();

//    private GenericRecord record = new GenericData.Record(AvroSchemas.SCHEMA);

    private AvroSchemas schema;

    private GenericRecord record;

    public AvroMapper() throws IOException {

        schema =new AvroSchemas();

        record = new GenericData.Record(schema.getCurrentSchema());

    }

    @Override

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        parser.parse(value.toString());

        if(parser.isValid()){

            record.put("year",parser.getYear());

            record.put("temperature",parser.getData());

            context.write(new AvroKey<>(record),new AvroValue<>(record));

        }

    }

}

3、reducer

public class AvroReducer extends Reducer<AvroKey<GenericRecord>,AvroValue<GenericRecord>,IntPair,NullWritable> {

    //多文件输出，本例中每年一个文件

    private MultipleOutputs<IntPair,NullWritable> multipleOutputs;

    /**

     * Called once at the start of the task.

     *

     * @param context

     */

    @Override

    protected void setup(Context context) throws IOException, InterruptedException {

        multipleOutputs = new MultipleOutputs<>(context);

    }

    @Override

    protected void reduce(AvroKey<GenericRecord> key, Iterable<AvroValue<GenericRecord>> values, Context context) throws IOException, InterruptedException {

        //在混洗阶段完成排序，reducer只需直接输出数据

        for (AvroValue<GenericRecord> value : values){

            GenericRecord record = value.datum();

            //多文件输出，每年一个文件。

            multipleOutputs.write(new IntPair((Integer) record.get("year"),(Integer)(record.get("temperature"))),NullWritable.get(),record.get("year").toString());

//            context.write(new IntPair((Integer) record.get("year"),(Integer)(record.get("temperature"))),NullWritable.get());

        }

    }

}

4、job

public class AvroSort extends Configured implements Tool {

    @Override

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();

        conf.set("mapreduce.job.ubertask.enable","true");

        Job job = Job.getInstance(conf,"Avro sort");

        job.setJarByClass(AvroSort.class);

        //通过AvroJob直接设置Avro key和value的输入和输出，而不是使用Job来设置

        AvroJob.setMapOutputKeySchema(job, AvroSchemas.SCHEMA);

        AvroJob.setMapOutputValueSchema(job,AvroSchemas.SCHEMA);

//        AvroJob.setOutputKeySchema(job,AvroSchemas.SCHEMA);

        job.setMapperClass(AvroMapper.class);

        job.setReducerClass(AvroReducer.class);

        job.setInputFormatClass(TextInputFormat.class);

//        job.setOutputFormatClass(AvroKeyOutputFormat.class);

        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job,new Path(args[0]));

        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        Path outPath = new Path(args[1]);

        FileSystem fileSystem = outPath.getFileSystem(conf);

        //删除输出路径

        if(fileSystem.exists(outPath))

        {

            fileSystem.delete(outPath,true);

        }

        return job.waitForCompletion(true) ? 0:1;

    }

    public static void main(String[] args) throws Exception{

        int exitCode = ToolRunner.run(new AvroSort(),args);

        System.exit(exitCode);

    }

}

hadoop 使用Avro排序的更多相关文章

hadoop 使用Avro求最大值
在上例中:hadoop MapReduce辅助排序解析,为了求每年的最大数据使用了mapreduce辅助排序的方法. 本例中介绍利用Avro这个序列化框架的mapreduce功能来实现求取最大值.Av ...
2 weekend110的hadoop的自定义排序实现 + mr程序中自定义分组的实现
我想得到按流量来排序,而且还是倒序,怎么达到实现呢? 达到下面这种效果, 默认是根据key来排, 我想根据value里的某个排, 解决思路:将value里的某个,放到key里去,然后来排下面,开始w ...
Hadoop之WritableComprale 排序
Hadoop之WritableComprale 排序 Hadoop只对key进行排序排序是 MapReduce 框架中最重要的操作之一.Map Task 和 Reduce Task 均会对数据(按照 ...
Hadoop日记Day18---MapReduce排序分组
本节所用到的数据下载地址为:http://pan.baidu.com/s/1bnfELmZ MapReduce的排序分组任务与要求我们知道排序分组是MapReduce中Mapper端的第四步,其中分 ...
Hadoop shuffle与排序
Mapreduce为了确保每个reducer的输入都按键排序.系统执行排序的过程-----将map的输出作为输入传给reducer 称为shuffle.学习shuffle是如何工作的有助于我们理解ma ...
三种方法实现Hadoop(MapReduce)全局排序(1)
我们可能会有些需求要求MapReduce的输出全局有序,这里说的有序是指Key全局有序.但是我们知道,MapReduce默认只是保证同一个分区内的Key是有序的,但是不保证全局有序.基于此,本文提供三 ...
hadoop streaming字段排序介绍
我们在使用hadoop streaming的时候默认streaming的map和reduce的separator不指定的话,map和reduce会根据它们默认的分隔符来进行排序 map.reduce: ...
一起学Hadoop——二次排序算法的实现
二次排序,从字面上可以理解为在对key排序的基础上对key所对应的值value排序,也叫辅助排序.一般情况下,MapReduce框架只对key排序,而不对key所对应的值排序,因此value的排序经常 ...
Hadoop mapreduce自定义排序WritableComparable
本文发表于本人博客. 今天继续写练习题,上次对分区稍微理解了一下,那根据那个步骤分区.排序.分组.规约来的话,今天应该是要写个排序有关的例子了,那好现在就开始! 说到排序我们可以查看下hadoop源码 ...

随机推荐

【POJ2182】Lost Cows
[POJ2182]Lost Cows 题面 vjudge 题解从后往前做每扫到一个点\(i\)以及比前面小的有\(a[i]\)个数就是查询当前的第\(a[i]+1\)小然后查询完将这个数删掉 ...
LeetCode: 59. Spiral Matrix II（Medium）
1. 原题链接 https://leetcode.com/problems/spiral-matrix-ii/description/ 2. 题目要求给定一个正整数n,求出从1到n平方的螺旋矩阵.例 ...
DXF结构查看小工具，DXF表格导出工具,CAD文档查看
用C#写了个查看DXF结构的工具,另做了个DXF表格(普通直线画的)导出为CSV表格工具发出来方便各位机械工程师,上几个图: 程序下载: 程序,需要.NET 4.0执行环境 https://pan.b ...
js 加密 crypto-js des加密
js 加密 crypto-js https://www.npmjs.com/package/crypto-js DES 举例: js 引入: <script src=&quo ...
Python教程：Python中的for 语句
Python 中的 for 语句与你在 C 或 Pascal 中可能用到的有所不同. Python教程中的 for 语句并不总是对算术递增的数值进行迭代(如同 Pascal),或是给予用户定义迭代步 ...
【springmvc+mybatis项目实战】杰信商贸-2.数据库配置
首先我们来了解项目的架构我们分别使用了MySql和Oracle数据库,即是异构数据库.我们做到一个平台支持多个数据库.数据库建模我们使用Sybase公司的PowerDesigner(以后简称PD), ...
（转）Shadow Mapping
原文:丢失,十分抱歉,这篇是在笔记上发现的.SmaEngine 阴影和级联部分是模仿UE的结构设计 This tutorial will cover how to implement shadow ...
解决CentOS: Failed to start The Apache HTTP Server.
使用systemctl status httpd.service命令查看服务状态,发现有报错然后将此配置文件/etc/httpd/conf.d/wordpress.conf的内容全部清空,修改为: ...
UESTC 1717 Journey（DFS+LCA）（Sichuan State Programming Contest 2012）
Description Bob has traveled to byteland, he find the N cities in byteland formed a tree structure, ...
Thunder团队第五周 - Scrum会议4
Scrum会议4 小组名称:Thunder 项目名称:i阅app Scrum Master:李传康工作照片: 邹双黛同学在拍照,所以不在照片内. 参会成员: 王航:http://www.cnblog ...

hadoop 使用Avro排序

hadoop 使用Avro排序的更多相关文章

随机推荐

热门专题