Wordcount -- MapReduce example -- Mapper

Mapper maps input key/value pairs into intermediate key/value pairs.

E.g.

Input: (docID, doc)

Output: (term, 1)

Mapper Class Prototype:

Mapper<Object, Text, Text, IntWritable>

// Object:: INPUT_KEY

// Text:: INPUT_VALUE

// Text:: OUTPUT_KEY

// IntWritable:: OUTPUT_VALUE

Special Data Type for Mapper

IntWritable

A serializable and comparable object for integer.

Example:

private final static IntWritable one = new IntWritable(1);

Text

A serializable, deserializable and comparable object for string at byte level. It stores text in UTF-8 encoding.

Example:

private Text word = new Text();

Hadoop defines its own classes for general data types.

-- All "values" must have Writable interface;

-- All "keys" must have WritableComparable interface;

Map Method for Mapper

Method header

public void map(Object key, Text value, Context context

               ) throws IOException, InterruptedException

// Object key:: Declare data type of input key;

// Text value:: Declare data type of input value;

// Context context:: Declare data type of output. Context is often used for output data collection.

Tokenization

// Use Java built-in StringTokenizer to split input value (document) into words:

StringTokenizer itr = new StringTokenizer(value.toString());

Building (key, value) pairs

// Loop over all words:

while (itr.hasMoreTokens()) {

  // convert built-in String back to Text:

  word.set(itr.nextToken());

  // build (key, value) pairs into Context and emit:

  context.write(word, one);

}

Map Method Summary

Mapper class produces Mapper.Context object, which comprise a series of (key, value) pairs

  public void map(Object key, Text value, Context context

                  ) throws IOException, InterruptedException {

    StringTokenizer itr = new StringTokenizer(value.toString());

    while (itr.hasMoreTokens()) {

      word.set(itr.nextToken());

      context.write(word, one);

    }

  }

Overview of Mapper Class

public static class TokenizerMapper

     extends Mapper<Object, Text, Text, IntWritable>{

  private final static IntWritable one = new IntWritable(1);

  private Text word = new Text();

  public void map(Object key, Text value, Context context

                  ) throws IOException, InterruptedException {

    StringTokenizer itr = new StringTokenizer(value.toString());

    while (itr.hasMoreTokens()) {

      word.set(itr.nextToken());

      context.write(word, one);

    }

  }

}

Wordcount -- MapReduce example -- Mapper的更多相关文章

MapReduce之Mapper类,Reducer类中的函数(转载)
Mapper类4个函数的解析 Mapper有setup(),map(),cleanup()和run()四个方法.其中setup()一般是用来进行一些map()前的准备工作,map()则一般承担主要的处 ...
hadoop中mapreduce的mapper抽象类和reduce抽象类
mapreduce过程key 和value分别存什么值 https://blog.csdn.net/csdnliuxin123524/article/details/80191199 Mapper抽象 ...
Wordcount -- MapReduce example -- Reducer
Reducer receives (key, values) pairs and aggregate values to a desired format, then write produced ( ...
MapReduce数据流-Mapper
mapreduce程序编写(WordCount)
折腾了半天.终于编写成功了第一个自己的mapreduce程序,并通过打jar包的方式运行起来了. 运行环境: windows 64bit eclipse 64bit jdk6.0 64bit 一.工程 ...
Java编程MapReduce实现WordCount
Java编程MapReduce实现WordCount 1.编写Mapper package net.toocruel.yarn.mapreduce.wordcount; import org.apac ...
Kettle实现MapReduce之WordCount
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 欢迎转载抽空用kettle配置了一个Mapreduce的Word count,发现还是很方便快捷的,废话不多说 ...
Hadoop（十七）之MapReduce作业配置与Mapper和Reducer类
前言前面一篇博文写的是Combiner优化MapReduce执行,也就是使用Combiner在map端执行减少reduce端的计算量. 一.作业的默认配置 MapReduce程序的默认配置 1)概述 ...
hadoop2.7之Mapper/reducer源码分析
一切从示例程序开始: 示例程序 Hadoop2.7 提供的示例程序WordCount.java package org.apache.hadoop.examples; import java.io.I ...

随机推荐

大数据框架-YARN
YARN(Yet Another Resource Negotiator): 是一种新的 Hadoop 资源管理器 [ResourceManager:纯粹的调度器,基于应用程序对资源的需求进行调度的, ...
系统优化怎么做-Linux系统配置优化
大家好,这里是「聊聊系统优化」,并在下列地址同步更新博客园:http://www.cnblogs.com/changsong/ 知乎专栏:https://zhuanlan.zhihu.com/yo ...
iOS之在AppDelegate中push到指定页面
UITabBarController *tbc = (UITabBarController *)self.window.rootViewController; UINavigationControll ...
window下pip install Scrapy报错解决方案
1.首先打开https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted,找到对应版本的Twisted并下载到你的文件夹. 2.利用pip install命令 ...
pastedeploy
3.1作用不修改WSGI应用程序的情况下通过配置文件配置WSGI服务. filter:过滤器,滤网. pipline:管道 app:application 应用,在这个语境下我举个例子吧,lavab ...
pt-online-schema-change在线修改表结构
工具简介 pt-osc模仿MySQL内部的改表方式进行改表,但整个改表过程是通过对原始表的拷贝来完成的,即在改表过程中原始表不会被锁定,并不影响对该表的读写操作.首先,osc创建与原始表相同的不包含数 ...
React的安装方法
一:直接使用 BootCDN 的 React CDN 库,地址如下: <script src="https://cdn.bootcss.com/react/16.4.0/umd/rea ...
python3 package management 包管理实例
包是一种组织管理代码的方式,包里面存放的是模块用于将模块包含在一起的文件夹就是包包内包含__init__.py标志性文件定义一个学生类,一个sayhello函数,一个打印语句 # p01.py ...
ruby中url解码并替换非法字符
url中中文字符解码 str = URI.decode(url_str) 替换非法字符 if ! str.valid_encoding? p str = str.encode("UTF-16 ...
pip快速git项目安装
pip install git+https://github.com/xx/xx.git