Top 10 IDs base on their value

First , we need to set the reduce to 1. For each map task, it is not a good idea to output each key/value pair. Instead, we can just output the top 10 IDs based on their value. So, less data will be written to disk and transferred to the reducer. If we need to get the top 10 for each mapper task, we need to iterator over the whole split. In map function, we collect each id/value, add it to the data structure that supports sorting like black-red tree, keep only the top 10. In the cleanup function, we output the result.

 //hadoop code for map/reduce task , see the cleanup function.
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
}
} finally {
cleanup(context);
}
}

The map task below. the sorted IDs is written in cleanup function.

The reduce task has the similar logic.(Note: there is only 1 reducer)

reference:https://www.youtube.com/watch?v=Bj6-maOjB8M

Map Reduce Application(Top 10 IDs base on their value)的更多相关文章

  1. Map Reduce Application(Partitioninig/Binning)

    Map Reduce Application(Partitioninig/Group data by a defined key) Assuming we want to group data by ...

  2. Map Reduce Application(Join)

    We are going to explain how join works in MR , we will focus on reduce side join and map side join. ...

  3. mapreduce: 揭秘InputFormat--掌控Map Reduce任务执行的利器

    随着越来越多的公司采用Hadoop,它所处理的问题类型也变得愈发多元化.随着Hadoop适用场景数量的不断膨胀,控制好怎样执行以及何处执行map任务显得至关重要.实现这种控制的方法之一就是自定义Inp ...

  4. OWAP Top 10

    2013 Top 10 List   A1-Injection Injection flaws, such as SQL, OS, and LDAP injection occur when untr ...

  5. Python进阶:函数式编程(高阶函数,map,reduce,filter,sorted,返回函数,匿名函数,偏函数)...啊啊啊

    函数式编程 函数是Python内建支持的一种封装,我们通过把大段代码拆成函数,通过一层一层的函数调用,就可以把复杂任务分解成简单的任务,这种分解可以称之为面向过程的程序设计.函数就是面向过程的程序设计 ...

  6. 安全检测:2013 Top 10 List

    转自:https://www.owasp.org/index.php/Top_10_2013-Top_10   Risk 2013 Table of Contents 2013 Top 10 List ...

  7. (转)Python进阶:函数式编程(高阶函数,map,reduce,filter,sorted,返回函数,匿名函数,偏函数)

    原文:https://www.cnblogs.com/chenwolong/p/reduce.html 函数式编程 函数是Python内建支持的一种封装,我们通过把大段代码拆成函数,通过一层一层的函数 ...

  8. Chapter 3 Top 10 List

    3.1 Introduction Given a set of (key-as-string, value-as-integer) pairs, then finding a Top-N ( wher ...

  9. MapReduce剖析笔记之三:Job的Map/Reduce Task初始化

    上一节分析了Job由JobClient提交到JobTracker的流程,利用RPC机制,JobTracker接收到Job ID和Job所在HDFS的目录,够早了JobInProgress对象,丢入队列 ...

随机推荐

  1. 03 Oracle分区表

    Oracle分区表   先说句题外话…   欢迎成都天府软件园的小伙伴来面基交流经验~ 一:什么是分区(Partition)? 分区是将一个表或索引物理地分解为多个更小.更可管理的部分. 分区对应用透 ...

  2. 用java集合模拟登录和注册功能

    package com.linkage.login; import java.util.HashMap;import java.util.Iterator;import java.util.Map;i ...

  3. Error creating bean with name 'mapper' defined in class path resource [applicationcontext.xml]: Cannot resolve reference to bean 'factory' while setting bean property 'sqlSessionFactory'; nested excep

    Error creating bean with name 'mapper' defined in class path resource [applicationcontext.xml]: Cann ...

  4. Django学习笔记2

    1.BookInfo.objects.all() objects:是Manager类型的对象,用于与数据库进行交互 当定义模型类时没有指定管理器,则Django会为模型类提供一个名为objects的管 ...

  5. 使用随机数以及扩容表进行join代码

    /** * 使用随机数和扩容表进行join */ JavaPairRDD<String, Row> expandedRDD = userid2InfoRDD.flatMapToPair( ...

  6. django的验证码

    pip install Pillow==3.4.1在views.py中创建一个视图函数 from PIL import Image, ImageDraw, ImageFont from django. ...

  7. Python学习知识库

    2017年10月16日 1. too broad exception clause 捕获的异常过于宽泛了,没有针对性,应该指定精确的异常类型场景: def check_data_type(column ...

  8. 【机器学习笔记】EM算法及其应用

    极大似然估计 考虑一个高斯分布\(p(\mathbf{x}\mid{\theta})\),其中\(\theta=(\mu,\Sigma)\).样本集\(X=\{x_1,...,x_N\}\)中每个样本 ...

  9. P2934 [USACO09JAN]安全出行Safe Travel

    P2934 [USACO09JAN]安全出行Safe Travel https://www.luogu.org/problemnew/show/P2934 分析: 建出最短路树,然后考虑一条非树边u, ...

  10. 水灾 1000MS 64MB (广搜)

    水灾(sliker.cpp/c/pas) 1000MS  64MB 大雨应经下了几天雨,却还是没有停的样子.土豪CCY刚从外地赚完1e元回来,知道不久除了自己别墅,其他的地方都将会被洪水淹没. CCY ...