Mapreuduce实现网络数据包的清洗工作

处理后的数据可直接放到hive或者mapreduce程序来统计网络数据流的信息，比如当前实现的是比较简单的http的Get请求的统计

第一个mapreduce：将时间、十六进制包头信息提取出来，并放在一行（这里涉及到mapreduce的键值对的对多行的特殊处理，是个值得注意的地方）

主要遇到两个问题：

　　一个数据包包含时间，包头的简单信息，包头的详细信息，初衷是想要把一个数据包的时间、包十六进制详细信息（存在于很多行里）按照顺序放置到一行，在java里面按行读取，很好实现。

针对mapreduce的键值对处理的特性，原来想到有两种方式解决：

（1）以时间的key值为准，一个包的信息key值与其相同

但MR的map每次只处理一行信息，而reduce只对键相同的行做处理，而且从map阶段到reduce的过程中有一个shuffle、sort阶段（估计是这个原因，也可能是因为离reduce近的机器处理完直接发给reduce，先到先处理），相同的key的value是乱序的。

(2)所有的key值递增

这样就没有相同的key值，无法放置到一行

最后的解决办法：

（3）以时间的key值为准，同一个包的信息的key值与其相同，但在十六进制行里加一个递增的id，放置到一行，虽然是乱序的，但自带ID，就重新排一下就好啦，妙！

第二个mapreduce: 对十六进制信息进行排序，是第一个mapreduce的补充，至此，清洗工作完毕，可以统计任意位置的十六进制来分析数据

第三个mapreduce：统计http发送的GET请求个数

static int id=1;

	static int hexId=1;

  public static class TokenizerMapper

       extends Mapper<Object, Text, IntWritable, Text>

 {

    private final static IntWritable one = new IntWritable(2);

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException

    {

    	//匹配时间

	 	String regexTime = "([0-2][0-4]):([0-5][0-9]):([0-5][0-9]).[0-9]{6}";// 11:08:56.149361

		Pattern patternTime = Pattern.compile(regexTime);

		Matcher matchTime = patternTime.matcher(value.toString());

		while (matchTime.find()) {

			String time ="time: " + matchTime.group()+" ";

			id=id+1;

			word.set(time);

			one.set(id);

			context.write(one, word);

		}

		//匹配十六进制

//		String regexHex = "0x[0-9]{4}:  ([A-Za-z0-9]{4} )+";

		String regexHex = " ([A-Za-z0-9]{4} )+";

		Pattern patternHex = Pattern.compile(regexHex);

		Matcher matchHex = patternHex.matcher(value.toString());

		while (matchHex.find()) {

			String hex = " "+ matchHex.group();

			hexId=hexId+1;

			hex="id:"+String.valueOf(hexId)+" "+hex;

			word.set(hex);

			one.set(id);

			context.write(one, word);

		}

    }

  }

  public static class IntSumReducer

       extends Reducer<IntWritable,Text,IntWritable,Text>

{

    private Text result = new Text();

    public void reduce(IntWritable key, Iterable<Text> values,

                       Context context

                       ) throws IOException, InterruptedException

  {

      String sum = "";

      for (Text val : values)

        {

          sum += val.toString();

         }

      result.set(sum);

      context.write(key, result);

    }

  }

public static class TokenizerMapper

       extends Mapper<Object, Text, Text, Text>

 {

    private final static Text one = new Text();

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException

    {

    	//匹配时间

	 	String regexTime = "time: ([0-2][0-4]):([0-5][0-9]):([0-5][0-9]).[0-9]{6}";// 11:08:56.149361

		Pattern patternTime = Pattern.compile(regexTime);

		Matcher matchTime = patternTime.matcher(value.toString());

		while (matchTime.find()) {

//			String time ="time: " + matchTime.group()+" ";

			String temptime =matchTime.group();

			String time =temptime.substring(6, temptime.length()-1);

			one.set(time);

		}

		//排序十六进制

//		String regexHex = "0x[0-9]{4}:  ([A-Za-z0-9]{4} )+";

		List<Bar> list = new ArrayList<Bar>();

		String regexHex = "id:([0-9])+   ([A-Za-z0-9]{4} )+";

		Pattern patternHex = Pattern.compile(regexHex);

		Matcher matchHex = patternHex.matcher(value.toString());

		while (matchHex.find()) {

			Bar bar = new Bar();

			String hexline = matchHex.group();

			String regexHex2 ="id:([0-9])+"; //一行十六进制的序号

			Pattern patternHex2 = Pattern.compile(regexHex2);

			Matcher matchHex2 = patternHex2.matcher(hexline);

			while (matchHex2.find()) {

				String lineId=matchHex2.group().toString().substring(3);

				bar.setId(lineId);

			}

			String regexHex3 ="([A-Za-z0-9]{4} )+"; //一行十六进制

			Pattern patternHex3 = Pattern.compile(regexHex3);

			Matcher matchHex3 = patternHex3.matcher(hexline);

			while (matchHex3.find()) {

				String lineHex= matchHex3.group().toString();

				bar.setHexValue(lineHex);

			}

			list.add(bar);

		}

		StringBuffer buffer = new StringBuffer("");

		 Collections.sort(list);

		for(int i=0;i<list.size();i++){

			Bar bar=list.get(i);

			String lineHex=bar.getHexValue();

			buffer.append(lineHex);

		}

		String hexOne= buffer.toString();

		word.set(hexOne);

		context.write(one, word);

    }

  }

  public static class IntSumReducer

       extends Reducer<Text,Text,Text,Text>

{

    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values,

                       Context context

                       ) throws IOException, InterruptedException

  {

      String sum = "";

      for (Text val : values)

        {

    	  context.write(key, val);

         }

    }

  }

	public static class TokenizerMapper extends

			Mapper<Object, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);

		private Text word = new Text("sumGet");

		public void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {

			int timelen=15;

			int getlen=20*5+timelen;

			String strline=value.toString();

			if (strline.length() > getlen) {// ||hexValue[20].equals("4854")

				String getPos=strline.substring(timelen+20*5,timelen+21*5-1);

				 if(getPos.equals("4745")){

					 context.write(word, one);

				 }

			}

		}

	}

	public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values, Context context)

				throws IOException, InterruptedException {

			int sum =0;

			for (IntWritable val : values) {

				sum+=val.get();

			}

			result.set(sum);

			context.write(key, result);

		}

	}

Mapreuduce实现网络数据包的清洗工作的更多相关文章

用C++实现网络编程---抓取网络数据包的实现方法
一般都熟悉sniffer这个工具,它可以捕捉流经本地网卡的所有数据包.抓取网络数据包进行分析有很多用处,如分析网络是否有网络病毒等异常数据,通信协议的分析(数据链路层协议.IP.UDP.TCP.甚至各 ...
UNIX网络编程——网络数据包检测
网络数据包检测数据包捕获(sniffer):是指在网络上进行数据收集的行为,需要通过网卡来完成. 三种访问方式: BSD Packet Filter(BPF) SVR4 Datalink Provi ...
LINUX下的远端主机登入校园网络注册网络数据包转发和捕获
第一部分:LINUX 下的远端主机登入和校园网注册校园网内目的主机远程管理登入程序本程序为校园网内远程登入,管理功能,该程序分服务器端和客户端两部分:服务器端为remote_server_udp. ...
Linux 中的网络数据包捕获
Linux 中的网络数据包捕获 Ashish Chaurasia, 工程师简介: 本教程介绍了捕获和操纵数据包的不同机制.安全应用程序,如 VPN.防火墙和嗅探器,以及网络应用程序,如路由程序,都依 ...
Linux内核中网络数据包的接收-第一部分概念和框架
与网络数据包的发送不同,网络收包是异步的的.由于你不确定谁会在什么时候突然发一个网络包给你.因此这个网络收包逻辑事实上包括两件事:1.数据包到来后的通知2.收到通知并从数据包中获取数据这两件事发生在协 ...
网络数据包分析网卡Offload
http://blog.nsfocus.net/network-packets-analysis-nic-offload/ 对于网络安全来说,网络传输数据包的捕获和分析是个基础工作,绿盟科技研 ...
Linux内核网络数据包处理流程
Linux内核网络数据包处理流程 from kernel-4.9: 0. Linux内核网络数据包处理流程 - 网络硬件网卡工作在物理层和数据链路层,主要由PHY/MAC芯片.Tx/Rx FIFO. ...
sk_buff封装和解封装网络数据包的过程详解（转载）
http://dog250.blog.51cto.com/2466061/1612791 可以说sk_buff结构体是Linux网络协议栈的核心中的核心,几乎所有的操作都是围绕sk_buff这个结构体 ...
linux2.6.24内核源代码分析（2）——扒一扒网络数据包在链路层的流向路径之一
在2.6.24内核中链路层接收网络数据包出现了两种方法,第一种是传统方法,利用中断来接收网络数据包,适用于低速设备:第二种是New Api(简称NAPI)方法,利用了中断+轮询的方法来接收网络数据包, ...

随机推荐

css img换行之后有空隙
这样的2个图片换行之后有空隙<img src="img/qiche.jpg" /> <br /> <img src="img/qiche.j ...
微信开发获取media_id错误码汇总
微信开发遇到的错误汇总: 1. 错误代码40001 "errcode": 40001, "errmsg": "invalid credentia ...
C# 值类型，引用类型区别
值类型/引用类型作为所有类型的基类,System.Object提供了一组方法,这些方法在所有类型中都能找到,其中包含toString方法及clone等方法. 引用类型和值类型都继承自System.O ...
AutoCAD开发选择----ObjectARX还是.net API（转载）
本文基于AutoCAD 2006新推出的.NET API为工具,介绍了在.NET平台下对AutoCAD进行二次开发的技术,并与目前常用的VBA.ObjectARX作了对比.同时讨论了如何弥补.NET ...
【转】Linux上vi(vim)编辑器使用教程
Linux上vi(vim)编辑器使用教程 ------------------------------------------------------------ ikong ------------ ...
shell参数传递
应用实例: #!/bin/bash #运行:bash para_tran.bash text1.txt text2.txt #"set $1"设置存储传入的第一参数 #" ...
封装的应用【example_Array工具】
定义一个数组工具[ArrayTool]封装其方法,ArrayDemo调用数组工具ArrayTool package new_Object; //封装多个个功能 class ArrayTool{ //1 ...
Cannot read property 'component' of undefined 即vue-router 0.x转化为2.x
原文链接:http://blog.csdn.net/m0_37754657/article/details/71269988 由于vue版本为1.0,没有一些vue-router指令:因而需要vue- ...
VS2008集成PC-lint
引言 C/C++语言的语法拥有其它语言所没有的灵活性,这种灵活性带来了代码效率的提升,但相应也使得代码编写具有很大的随意性,另外C/C++编译器不进行强制类型检查,也不做任何边界检查,这就增加了代码中 ...
js作用域的相关知识
众所周知,在ES6之前,JavaScript是没有块级作用域的,如下图所示: 学过其他语言的同学肯定有点诧异,为什么会这样呢?因为js还是不同于其他语言的,在ES5中,只有全局作用域和函数作用域,并没 ...

Mapreuduce实现网络数据包的清洗工作

Mapreuduce实现网络数据包的清洗工作的更多相关文章

随机推荐

热门专题