We've seen the internals of MapReduce in the last post. Now we can make a little change to the WordCount and create a JAR for being executed by Hadoop. 
If we look at the result of the WordCount we ran before, the lines of the file are only split by space, and thus all other punctuation characters and symbols remain attached to the words and in some way "invalidate" the count. For example, we can see some of these wrong values:

KING 1
KING'S 1
KING. 5
King 6
King, 4
King. 2
King; 1

The word is always king but sometimes it appears in upper case, sometimes in lower case, or with a punctuation character after it. So we'd like to update the behaviour of the WordCount program to count all the occurrences of any word, aside from the punctuation and other symbols. In order to do so, we have to modify the code of the mapper, since it's here that we get the data from the file and split it. 
If we look at the code of mapper:

public static class TokenizerMapper extends Mapper<object, text,="" intwritable=""> {

      private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

we see that the StringTokenizer takes the line as the parameter; before doing that, we remove all the symbols from the line using a RegExp that maps each of these symbols into a space:

String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");

that simply says "if you see any of these character _, |, $, #, <, >, [, ], *, /, \, ,, ;, ., -, :, ,(, ), ?, !, ", ' transform it into a space". All the backslashes are needed for correctly escaping the characters. Then we trim the token to avoid empty tokens:

itr.nextToken().trim()

So now the code looks like this (the updates to the original file are printed in bold):

public static class TokenizerMapper extends Mapper<object, text,="" intwritable=""> {

 private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); @Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");
StringTokenizer itr = new StringTokenizer(cleanLine);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().trim());
context.write(word, one);
}
}
}

The complete code of the class is available on my github repository.

We now want to create the JAR to be executed; to do it, we have to go to the output directory of our wordcount file (the directory that contains the .class files) and create a manifest filenamed manifest_file that contains something like this:

Manifest-Version: 1.0
Main-Class: samples.wordcount.WordCount

in which we tell the JAR which class to execute at startup. More details in Java official documentation. Note that there's no need to add classpath info, because the JAR will be run by Hadoop that already has a classpath that contains all needed libraries. 
Now we can launch the command:

$ jar cmf manifest_file wordcount.jar .

that creates a JAR named wordcount.jar that contains all classes starting from this directory (the . parameter) e using the content of manifest_file to create a Manifest. 
We can run the batch as we saw before. Looking at the result file we can check that there are no more symbols and punctuation characters; the occurrences of the word king is now the sum of all occurrences we found before:

king 20

which is what we were looking for.

from: http://andreaiacono.blogspot.com/2014/02/creating-jar-for-hadoop.html

为Hadoop创建JAR包文件Creating a JAR for Hadoop的更多相关文章

  1. 有引用外部jar包时(J2SE)生成jar文件

    一.工程没有引用外部jar包时(J2SE) 选中工程---->右键,Export...--->Java--->选择JAR file--->next-->选择jar fil ...

  2. jmeter,badboy,jar包文件 常数吞吐量计时器?

    badboy录制脚本 1.按f2 红色开始录制 URL输入:https://www.so.com/ 2.搜索框输入zxw 回车键搜索 3.选中关键字(刮例如zxw软件——>tools——> ...

  3. spring原理案例-基本项目搭建 02 spring jar包详解 spring jar包的用途

    Spring4 Jar包详解 SpringJava Spring AOP: Spring的面向切面编程,提供AOP(面向切面编程)的实现 Spring Aspects: Spring提供的对Aspec ...

  4. Eclipse用Runnable JAR file方式打jar包,并用该jar包进行二次开发

    目录: 1.eclipse创建Java项目(带jar包的) 2. eclipse用Export的Runnable JAR file方式打jar包(带jar包的) 打jar包 1)class2json1 ...

  5. 多个jar包合并成一个jar包(ant)

    https://blog.csdn.net/gzl003csdn/article/details/53539133 多个jar包合并成一个jar 使用Apache的Ant是一个基于Java的生成工具. ...

  6. spring boot:在项目中引入第三方外部jar包集成为本地jar包(spring boot 2.3.2)

    一,为什么要集成外部jar包? 不是所有的第三方库都会上传到mvnrepository, 这时我们如果想集成它的第三方库,则需要直接在项目中集成它们的jar包, 在操作上还是很简单的, 这里用luos ...

  7. 多个jar包合并成一个jar包的办法

    步骤: 1.将多个JAR包使用压缩软件打开,并将全包名的类拷贝到一个临时目录地下. 2.cmd命令到该临时目录下,此时会有很多.class文件,其中需要带完整包路径 3.执行 jar -cvfM te ...

  8. Ant打包可运行的Jar包(加入第三方jar包)

    本章介绍使用ant打包可运行的Jar包. 打包jar包最大的问题在于如何加入第三方jar包使得jar文件可以直接运行.以下用一个实例程序进行说明. 程序结构: 关键代码: package com.al ...

  9. 【Maven学习】Maven打包生成普通jar包、可运行jar包、包含所有依赖的jar包

    http://blog.csdn.net/u013177446/article/details/54134394 ******************************************* ...

随机推荐

  1. jquery 修改样式

    //显示待办数字 function showdb(url,ID) {   jQuery.get(url,function(data,status){ if(!isNaN(data)) {  if(da ...

  2. 自定义mvc验证特性,手机号号段老增加,给自定义一个RegularExpress

    public class PhoneExpressionAttribute: RegularExpressionAttribute, IClientValidatable { public Phone ...

  3. 洛谷P4151 [WC2011] 最大XOR和路径 [线性基,DFS]

    题目传送门 最大XOR和路径 格式难调,题面就不放了. 分析: 一道需要深刻理解线性基的题目. 好久没打过线性基的题了,一开始看到这题还是有点蒙逼的,想了几种方法全被否定了.还是看了大佬的题解才会做的 ...

  4. 2017/11/20 Leetcode 日记

    2017/11/14 Leetcode 日记 442. Find All Duplicates in an Array Given an array of integers, 1 ≤ a[i] ≤ n ...

  5. FastReport.Net使用:[34]小册子报表(奇偶页)

    打印一份小册子类型的报表,能实现如下要求: ●单独的封面,目录,报表内容,背面 ●奇偶页不同的页边距 ●奇偶页不同的页面/页脚 下面的例子将用到以上3点. 1.奇偶页的实现主要通过报表控件对象的Pri ...

  6. FastReport.Net使用:[23]图表(Chart)控件

    图表基本设置 1.拖放一个图表控件到报表设计界面中. 2.右键菜单“编辑”或者双击图表进入图表编辑器 3.将原有的簇状柱状图删除,添加圆环图 4.绑定数据源,并且指定X,Y轴数据. X轴数据为科目名称 ...

  7. HDU 1213(并查集)

    How Many Tables Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others)T ...

  8. 【BZOJ 1923】1923: [Sdoi2010]外星千足虫 (高斯消元异或 | BITSET用法)

    1923: [Sdoi2010]外星千足虫 Description Input 第一行是两个正整数 N, M. 接下来 M行,按顺序给出 Charles 这M次使用“点足机”的统计结果.每行 包含一个 ...

  9. BZOJ4599[JLoi2016&LNoi2016]成绩比较(dp+拉格朗日插值)

    这个题我们首先可以dp,f[i][j]表示前i个科目恰好碾压了j个人的方案数,然后进行转移.我们先不考虑每个人的分数,先只关心和B的相对大小关系.我们设R[i]为第i科比B分数少的人数,则有f[i][ ...

  10. 我的OI生涯 第七章 终篇

    11.10日. 我们TSOI再次来到了熟悉的燕山大学,只不过这次是真刀真枪的干了. 望着那座熟悉的小桥,身边的好友不知此行过后还有多少. 下午才到,出人意外的是这次没有住燕大宾馆而是选择了熟悉的格林豪 ...