hadoop 使用Avro排序

在上例中，使用Avro框架求出数据的最大值，本例使用Avro对数据排序，输入依然是之前的样本，输出使用文本（也可以输出Avro格式）。

1、在Avro的Schema中直接设置排序方向。

dataRecord.avsc，放入resources目录下：

{

    "type":"record",

    "name":"WeatherRecord",

    "doc":"A weather reading",

    "fields":[

        {"name":"year","type":"int"},

        {"name":"temperature","type":"int","order":"descending"}

    ]

}

原常量类：

public class AvroSchemas {

    private Schema currentSchema;

    //本例中不使用常量，修改成资源中加载

    public static final Schema SCHEMA = new Schema.Parser().parse("{\n" +

            "\t\"type\":\"record\",\n" +

            "\t\"name\":\"WeatherRecord\",\n" +

            "\t\"doc\":\"A weather reading\",\n" +

            "\t\"fields\":[\n" +

            "\t\t{\"name\":\"year\",\"type\":\"int\"},\n" +

            "\t\t{\"name\":\"temperature\",\"type\":\"int\",\"order\":\"descending\"}\n" +

            "\t]\t\n" +

            "}");

    public AvroSchemas() throws IOException {

        Schema.Parser parser = new Schema.Parser();

        //采用从资源文件中读取Avro数据格式

        this.currentSchema = parser.parse(getClass().getResourceAsStream("dataRecord.avsc"));

    }

    public Schema getCurrentSchema() {

        return currentSchema;

    }

}

2、mapper

public class AvroMapper extends Mapper<LongWritable,Text,AvroKey<GenericRecord>,AvroValue<GenericRecord>> {

    private RecordParser parser = new RecordParser();

//    private GenericRecord record = new GenericData.Record(AvroSchemas.SCHEMA);

    private AvroSchemas schema;

    private GenericRecord record;

    public AvroMapper() throws IOException {

        schema =new AvroSchemas();

        record = new GenericData.Record(schema.getCurrentSchema());

    }

    @Override

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        parser.parse(value.toString());

        if(parser.isValid()){

            record.put("year",parser.getYear());

            record.put("temperature",parser.getData());

            context.write(new AvroKey<>(record),new AvroValue<>(record));

        }

    }

}

3、reducer

public class AvroReducer extends Reducer<AvroKey<GenericRecord>,AvroValue<GenericRecord>,IntPair,NullWritable> {

    //多文件输出，本例中每年一个文件

    private MultipleOutputs<IntPair,NullWritable> multipleOutputs;

    /**

     * Called once at the start of the task.

     *

     * @param context

     */

    @Override

    protected void setup(Context context) throws IOException, InterruptedException {

        multipleOutputs = new MultipleOutputs<>(context);

    }

    @Override

    protected void reduce(AvroKey<GenericRecord> key, Iterable<AvroValue<GenericRecord>> values, Context context) throws IOException, InterruptedException {

        //在混洗阶段完成排序，reducer只需直接输出数据

        for (AvroValue<GenericRecord> value : values){

            GenericRecord record = value.datum();

            //多文件输出，每年一个文件。

            multipleOutputs.write(new IntPair((Integer) record.get("year"),(Integer)(record.get("temperature"))),NullWritable.get(),record.get("year").toString());

//            context.write(new IntPair((Integer) record.get("year"),(Integer)(record.get("temperature"))),NullWritable.get());

        }

    }

}

4、job

public class AvroSort extends Configured implements Tool {

    @Override

    public int run(String[] args) throws Exception {

        Configuration conf = getConf();

        conf.set("mapreduce.job.ubertask.enable","true");

        Job job = Job.getInstance(conf,"Avro sort");

        job.setJarByClass(AvroSort.class);

        //通过AvroJob直接设置Avro key和value的输入和输出，而不是使用Job来设置

        AvroJob.setMapOutputKeySchema(job, AvroSchemas.SCHEMA);

        AvroJob.setMapOutputValueSchema(job,AvroSchemas.SCHEMA);

//        AvroJob.setOutputKeySchema(job,AvroSchemas.SCHEMA);

        job.setMapperClass(AvroMapper.class);

        job.setReducerClass(AvroReducer.class);

        job.setInputFormatClass(TextInputFormat.class);

//        job.setOutputFormatClass(AvroKeyOutputFormat.class);

        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job,new Path(args[0]));

        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        Path outPath = new Path(args[1]);

        FileSystem fileSystem = outPath.getFileSystem(conf);

        //删除输出路径

        if(fileSystem.exists(outPath))

        {

            fileSystem.delete(outPath,true);

        }

        return job.waitForCompletion(true) ? 0:1;

    }

    public static void main(String[] args) throws Exception{

        int exitCode = ToolRunner.run(new AvroSort(),args);

        System.exit(exitCode);

    }

}

hadoop 使用Avro排序的更多相关文章

hadoop 使用Avro求最大值
在上例中:hadoop MapReduce辅助排序解析,为了求每年的最大数据使用了mapreduce辅助排序的方法. 本例中介绍利用Avro这个序列化框架的mapreduce功能来实现求取最大值.Av ...
2 weekend110的hadoop的自定义排序实现 + mr程序中自定义分组的实现
我想得到按流量来排序,而且还是倒序,怎么达到实现呢? 达到下面这种效果, 默认是根据key来排, 我想根据value里的某个排, 解决思路:将value里的某个,放到key里去,然后来排下面,开始w ...
Hadoop之WritableComprale 排序
Hadoop之WritableComprale 排序 Hadoop只对key进行排序排序是 MapReduce 框架中最重要的操作之一.Map Task 和 Reduce Task 均会对数据(按照 ...
Hadoop日记Day18---MapReduce排序分组
本节所用到的数据下载地址为:http://pan.baidu.com/s/1bnfELmZ MapReduce的排序分组任务与要求我们知道排序分组是MapReduce中Mapper端的第四步,其中分 ...
Hadoop shuffle与排序
Mapreduce为了确保每个reducer的输入都按键排序.系统执行排序的过程-----将map的输出作为输入传给reducer 称为shuffle.学习shuffle是如何工作的有助于我们理解ma ...
三种方法实现Hadoop(MapReduce)全局排序(1)
我们可能会有些需求要求MapReduce的输出全局有序,这里说的有序是指Key全局有序.但是我们知道,MapReduce默认只是保证同一个分区内的Key是有序的,但是不保证全局有序.基于此,本文提供三 ...
hadoop streaming字段排序介绍
我们在使用hadoop streaming的时候默认streaming的map和reduce的separator不指定的话,map和reduce会根据它们默认的分隔符来进行排序 map.reduce: ...
一起学Hadoop——二次排序算法的实现
二次排序,从字面上可以理解为在对key排序的基础上对key所对应的值value排序,也叫辅助排序.一般情况下,MapReduce框架只对key排序,而不对key所对应的值排序,因此value的排序经常 ...
Hadoop mapreduce自定义排序WritableComparable
本文发表于本人博客. 今天继续写练习题,上次对分区稍微理解了一下,那根据那个步骤分区.排序.分组.规约来的话,今天应该是要写个排序有关的例子了,那好现在就开始! 说到排序我们可以查看下hadoop源码 ...

随机推荐

Jsp刷新分页模板，很全
1.用来实现上一页下一页,我直接写到查询页面上 <%--page的分页--%> <style type="text/css"> a { color: # ...
Python 通过sgmllib模块解析HTML
""" 对html文本的解析方案-示例:在标签开始的时候检查标签中的attrs属性,解析出所有的参数的href属性值依赖安装:pip install sgmllib3k ...
Unity初探—SpaceShoot
Unity初探—SpaceShoot DestroyByBoundary脚本(C#) 在游戏中我们添加了一个Cube正方体,让他来作为游戏的边界.它是可以触发触发事件的(勾选Is Trigger),当 ...
linux系统简单命令
# uname -a # 查看内核/操作系统/CPU信息 # head -n 1 /etc/issue # 查看操作系统版本 # cat /proc/cpuinfo # 查看CPU信息 # hostn ...
【转】VSstudio中的一些宏
说明 $(RemoteMachine) 设置为“调试”属性页上“远程计算机”属性的值.有关更多信息,请参见更改用于 C/C++ 调试配置的项目设置. $(References) 以分号分隔的引用列表被 ...
JDK源码分析：Object.java
一. 序言 Object.java是一切类的基类,所以了解该类有一定的必要二 .属性及方法分析方法列表: private static native void registerNatives(); ...
[HNOI2017]大佬
参考题解 $\text{Solution}$ 我们发现5个行为中2操作与其它操作无关,所以我们采用贪心,尽量让多的时间去攻击大佬. 设 $f[i][j]$ 表示前 $i$ 天剩 $j$ ...
TensorFlow入门之MNIST最佳实践-深度学习
在上一篇<TensorFlow入门之MNIST样例代码分析>中,我们讲解了如果来用一个三层全连接网络实现手写数字识别.但是在实际运用中我们需要更有效率,更加灵活的代码.在TensorFlo ...
[2017 - 2018 ACL] 对话系统论文研究点整理
(论文编号及摘要见 [2017 ACL] 对话系统. [2018 ACL Long] 对话系统. 论文标题[]中最后的数字表示截止2019.1.21 google被引次数) 1. Domain Ada ...
CVPR-2018 那些有趣的新想法
Taylor Guo @ Shanghai - 2018.10.18 缘起还有什么比顶级会议更适合寻找有趣新想法的地方吗?我们从CVPR 2018 计算机视觉和模式识别的顶级会议中发现了很多有趣的东 ...

hadoop 使用Avro排序

hadoop 使用Avro排序的更多相关文章

随机推荐

热门专题