hadoop之mapreduce编程实例(系统日志初步清洗过滤处理)

刚刚开始接触hadoop的时候，总觉得必须要先安装hadoop集群才能开始学习MR编程，其实并不用这样，当然如果你有条件有机器那最好是自己安装配置一个hadoop集群，这样你会更容易理解其工作原理。我们今天就是要给大家演示如何不用安装hadoop直接调试编程MapReduce函数。

开始之前我们先来理解一下mapreduce的工作原理：

hadoop集群是有DataNode和NameNode两种节点构成，DataNode负责存储数据本身而NameNode负责存储数据的元数据信息，在启动mapreduce任务时，数据首先是通过inputformat模块从集群的文件库中读出，然后按照设定的Splitsize进行Split（默认是一个block大小128MB），通过ReadRecorder（RR）将每个split的数据块按行进行轮询访问结果给到map函数，由map函数按照编程的代码逻辑进行处理，输出key和value。由map到reduce的处理过程中包含三件事情，Combiner（map端的预先处理，相对于map段reduce）Partitioner（负责将map输出数据均衡的分配给reduce）Shulffling&&sort(根据map输出的key进行洗牌和排序，将结果根据partitioner的分配情况传输给指定的reduce)，最后reduce按照代码逻辑处理输出结果（也是key,value格式）。

注意：

map阶段的key-value对的格式是由输入的格式所决定的，如果是默认的TextInputFormat，则每行作为一个记录进程处理，其中key为此行的开头相对于文件的起始位置，value就是此行的字符文本
map阶段的输出的key-value对的格式必须同reduce阶段的输入key-value对的格式相对应

下面是wordcount的处理过程大家来理解一下：

现在我们开始我们的本地MR编程吧

首先我们得去官网下载一个hadoop安装包（本文用的hadoop2.6.0版本，不用安装，我们只要包中jars）

下载链接：https://archive.apache.org/dist/hadoop/common/（下载最多的那个就可以了，版本自己选个）

下面就上MR的代码吧：

package loganalysis;
import java.io.IOException;
import java.util.StringTokenizer;
import java.lang.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String imei = new String();
private String areacode = new String();
private String responsedata = new String();
private String requesttime = new String();
private String requestip = new String();
// map阶段的key-value对的格式是由输入的格式所决定的，如果是默认的TextInputFormat，则每行作为一个记录进程处理，其中key为此行的开头相对于文件的起始位置，value就是此行的字符文本
// map阶段的输出的key-value对的格式必须同reduce阶段的输入key-value对的格式相对应
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
//StringTokenizer itr = new StringTokenizer(value.toString());
int areai = value.toString().indexOf("areacode", 21);
int imeii = value.toString().indexOf("imei", 21);
int redatai = value.toString().indexOf("responsedata", 21);
int retimei = value.toString().indexOf("requesttime", 21);
int reipi = value.toString().indexOf("requestip", 21);
if (areai==-1)
{ areacode=""; }
else
{
areacode=value.toString().substring(areai+11);
int len2=areacode.indexOf("\"");
if(len2 <= 1)
{
areacode="";
}
else
{
areacode=areacode.substring(0,len2);
}
}
if (imeii==-1)
{ imei=""; }
else
{
imei=value.toString().substring(imeii+9);
int len2=imei.indexOf("\\");
if(len2 <= 1)
{
imei="";
}
else
{
imei=imei.substring(0,len2);
}
}
if (redatai==-1)
{ responsedata=""; }
else
{
responsedata=value.toString().substring(redatai+15);
int len2=responsedata.indexOf("\"");
if(len2 <= 1)
{
responsedata="";
}
else
{
responsedata=responsedata.substring(0,len2);
}
}
if (retimei==-1)
{ requesttime=""; }
else
{
requesttime=value.toString().substring(retimei+14);
int len2=requesttime.indexOf("\"");
if(len2 <= 1)
{
requesttime="";
}
else
{
requesttime=requesttime.substring(0,len2);
}
}
if (reipi==-1)
{ requestip=""; }
else
{
requestip=value.toString().substring(reipi+12);
int len2=requestip.indexOf("\"");
if(len2 <= 1)
{
requestip="";
}
else
{
requestip=requestip.substring(0,len2);
}
}
/* while (itr.hasMoreTokens()) {
string tim;
word.set(itr.nextToken());
context.write(word, one);
}*/
if(imei!=""&&areacode!=""&&responsedata!=""&&requesttime!=""&&requestip!="")
{
String wd=new String();
wd=imei+"\t"+areacode+"\t"+responsedata+"\t"+requesttime+"\t"+requestip;
//wd="areacode|"+areacode +"|imei|"+ imei +"|responsedata|"+ responsedata +"|requesttime|"+ requesttime +"|requestip|"+ requestip;
word.set(wd);
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
String[] otherArgs=new String[]{"/Users/mac/tmp/inputmr","/Users/mac/tmp/output1"};
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
//Job job = new Job(conf, "word count");
Job job = Job.getInstance(conf);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

主要以上除了jdk1.7其他的jar包都来自hadoop安装包中的share文件下下面

如果你不知道那些包需要那就将share\hadoop\下面的所以得jar包都添加到项目中

注意：我的电脑是mac pro如果你的是Windows机器相关的路径需要修改一下，前面加上“file:///”（ file:///D:\tmp\input file:///D:\tmp\output)

String[] otherArgs=new String[]{"file:///D:\tmp\input","file:///D:\tmp\output"};
这个程序核心代码都是在map中，主要做了系统日志中相关核心字段的提取并拼接以key形式返回给reduce，value都是设置为1，是为了方便以后的统计。因为是实例所以简单的弄了几个字段，实际可不止这些。

下面给下测试的系统日志：

2016-04-18 16:00:00 {"areacode":"浙江省丽水市","countAll":0,"countCorrect":0,"datatime":"4134362","logid":"201604181600001184409476","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966390499\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"100\",\"imei\":\"12345678900987654321\",\"subjectNum\":\"13989589062\",\"imsi\":\"12345678900987654321\",\"queryNum\":\"13989589062\"}","requestip":"36.16.128.234","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}
2016-04-18 16:00:00 {"areacode":"宁夏银川市","countAll":0,"countCorrect":0,"datatime":"4715990","logid":"201604181600001858043208","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966400120\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"1210\",\"imei\":\"A0000044ABFD25\",\"subjectNum\":\"15379681917\",\"imsi\":\"460036951451601\",\"queryNum\":\"\"}","requestip":"115.168.93.87","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果","userAgent":"ZTE-Me/Mobile"}
2016-04-18 16:00:00 {"areacode":"黑龙江省哈尔滨市","countAll":0,"countCorrect":0,"datatime":"5369561","logid":"201604181600001068429609","requestinfo":"{\"interfaceUserName\":\"12345678900987654321\",\"queryNum\":\"\",\"timestamp\":\"1460966400139\",\"sign\":\"4\",\"imsi\":\"460030301212545\",\"imei\":\"35460207765269\",\"subjectNum\":\"55588237\",\"subjectPro\":\"123456\",\"remark\":\"4\",\"channelno\":\"2100\"}","requestip":"42.184.41.180","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}
2016-04-18 16:00:00 {"areacode":"浙江省丽水市","countAll":0,"countCorrect":0,"datatime":"4003096","logid":"201604181600001648238807","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966391025\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"100\",\"imei\":\"12345678900987654321\",\"subjectNum\":\"13989589062\",\"imsi\":\"12345678900987654321\",\"queryNum\":\"13989589062\"}","requestip":"36.16.128.234","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}
2016-04-18 16:00:00 {"areacode":"广西南宁市","countAll":0,"countCorrect":0,"datatime":"4047993","logid":"201604181600001570024205","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966382871\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"1006\",\"imei\":\"A000004853168C\",\"subjectNum\":\"07765232589\",\"imsi\":\"460031210400007\",\"queryNum\":\"13317810717\"}","requestip":"219.159.72.3","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}
2016-04-18 16:00:00 {"areacode":"海南省五指山市","countAll":0,"countCorrect":0,"datatime":"5164117","logid":"201604181600001227842048","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966399159\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"1017\",\"imei\":\"A000005543AFB7\",\"subjectNum\":\"089836329061\",\"imsi\":\"460036380954376\",\"queryNum\":\"13389875751\"}","requestip":"140.240.171.71","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}
2016-04-18 16:00:00 {"areacode":"山西省","countAll":0,"countCorrect":0,"datatime":"14075772","logid":"201604181600001284030648","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966400332\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"1006\",\"imei\":\"A000004FE0218A\",\"subjectNum\":\"03514043633\",\"imsi\":\"460037471517070\",\"queryNum\":\"\"}","requestip":"1.68.5.227","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}
2016-04-18 16:00:00 {"areacode":"四川省","countAll":0,"countCorrect":0,"datatime":"6270982","logid":"201604181600001173504863","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966398896\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"100\",\"imei\":\"12345678900987654321\",\"subjectNum\":\"13666231300\",\"imsi\":\"12345678900987654321\",\"queryNum\":\"13666231300\"}","requestip":"182.144.66.97","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}
2016-04-18 16:00:00 {"areacode":"浙江省","countAll":0,"countCorrect":0,"datatime":"4198522","logid":"201604181600001390637240","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966399464\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"100\",\"imei\":\"12345678900987654321\",\"subjectNum\":\"05533876327\",\"imsi\":\"12345678900987654321\",\"queryNum\":\"05533876327\"}","requestip":"36.23.9.49","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"000000","responsedata":"操作成功"}
2016-04-18 16:00:00 {"areacode":"江苏省连云港市","countAll":0,"countCorrect":0,"datatime":"4408097","logid":"201604181600001249944032","requestinfo":"{\"sign\":\"4\",\"timestamp\":\"1460966395908\",\"remark\":\"4\",\"subjectPro\":\"123456\",\"interfaceUserName\":\"12345678900987654321\",\"channelno\":\"100\",\"imei\":\"12345678900987654321\",\"subjectNum\":\"18361451463\",\"imsi\":\"12345678900987654321\",\"queryNum\":\"18361451463\"}","requestip":"58.223.4.210","requesttime":"2016-04-18 16:00:00","requesttype":"0","responsecode":"010005","responsedata":"无查询结果"}

最后给出运行结果截图：

hadoop之mapreduce编程实例(系统日志初步清洗过滤处理)的更多相关文章

MapReduce编程实例6
前提准备: 1.hadoop安装运行正常.Hadoop安装配置请参考:Ubuntu下 Hadoop 1.2.1 配置安装 2.集成开发环境正常.集成开发环境配置请参考 :Ubuntu 搭建Hadoop ...
MapReduce编程实例5
前提准备: 1.hadoop安装运行正常.Hadoop安装配置请参考:Ubuntu下 Hadoop 1.2.1 配置安装 2.集成开发环境正常.集成开发环境配置请参考 :Ubuntu 搭建Hadoop ...
MapReduce编程实例4
MapReduce编程实例: MapReduce编程实例(一),详细介绍在集成环境中运行第一个MapReduce程序 WordCount及代码分析 MapReduce编程实例(二),计算学生平均成绩 ...
MapReduce编程实例3
MapReduce编程实例: MapReduce编程实例(一),详细介绍在集成环境中运行第一个MapReduce程序 WordCount及代码分析 MapReduce编程实例(二),计算学生平均成绩 ...
MapReduce编程实例2
MapReduce编程实例: MapReduce编程实例(一),详细介绍在集成环境中运行第一个MapReduce程序 WordCount及代码分析 MapReduce编程实例(二),计算学生平均成绩 ...
三、MapReduce编程实例
前文一.CentOS7 hadoop3.3.1安装(单机分布式.伪分布式.分布式二.JAVA API实现HDFS MapReduce编程实例 @ 目录前文 MapReduce编程实例前言注意 ...
hadoop2.2编程：使用MapReduce编程实例（转）
原文链接:http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html 从网上搜到的一篇hadoop的编程实例,对于初学者真是帮助太大 ...
MapReduce编程实例
MapReduce常见编程实例集锦. WordCount单词统计数据去重倒排索引 1. WordCount单词统计 (1) 输入输出输入数据: file1.csv内容 hellod world ...
[Hadoop入门] - 1 Ubuntu系统 Hadoop介绍 MapReduce编程思想
Ubuntu系统 (我用到版本号是140.4) ubuntu系统是一个以桌面应用为主的Linux操作系统,Ubuntu基于Debian发行版和GNOME桌面环境.Ubuntu的目标在于为一般用户提供一 ...

随机推荐

（剑指Offer）面试题38：数字在排序数组中出现的次数
题目: 统计一个数字在排序数组中出现的次数. 思路: 1.顺序遍历顺序扫描一遍数组,统计该数字出现的次数. 时间复杂度:O(n) 2.二分查找假设我们需要找的数字是k,那么就需要找到数组中的第一个 ...
记一个使用Client Object Model上传文件的小例子
1. 新建一个C#的Console project. 2. 给project 添加reference: Microsoft.SharePoint.Client Microsoft.SharePoint ...
使用CSS3建立不可选的的文字
下面的例子展示了在HTML5中你如何使用CSS建立不可选的文字. <!DOCTYPE HTML> <html> <head> <title>Creati ...
Node.js：模块系统、函数
为了让Node.js的文件可以相互调用,Node.js提供了一个简单的模块系统. 模块是Node.js 应用程序的基本组成部分,文件和模块是一一对应的.换言之,一个 Node.js 文件就是一个模块, ...
css背景颜色渐变
1.效果 2.代码 /* 基本色 */ background: #3FB0AC; /* chrome 2+, safari 4+; multiple color stops */ background ...
Android WebView File域同源策略绕过漏洞浅析
0x00 我们首先讲一个webView这种方法的作用: webView.getSettings().setAllowFileAccessFromFileURLs(false); ...
redis配置密码的方法
打开redis.conf配置文件,找到requirepass,然后修改如下: requirepass yourpasswordyourpassword就是redis验证密码,设置密码以后发现可以登陆, ...
MySQL InnoDB简介
从MySQL 5.5版本开始InnoDB已经是默认的表存储引擎 InnoDB 1:完全支持ACID 2:支持行级锁 3:支持MVCC 4:支持外键 MySQL 5.1版本 MySQL 5.1版本之前, ...
Java之字节码(3) - 简单介绍
转载来自首先了解一下理论知识: 字节码: Class文件是8位字节流,按字节对齐.之所以称为字节码,是因为每条指令都只占据一个字节,所有的操作码和操作数都是按字节对齐的.如:0×03表示iconst ...
mybatis select/insert/update/delete
这里做了比较清晰的解释: http://mybatis.github.io/mybatis-3/java-api.html SqlSession As mentioned above, the Sql ...

hadoop之mapreduce编程实例(系统日志初步清洗过滤处理)

hadoop之mapreduce编程实例(系统日志初步清洗过滤处理)的更多相关文章

随机推荐

热门专题