Mapreuduce实现网络数据包的清洗工作

处理后的数据可直接放到hive或者mapreduce程序来统计网络数据流的信息，比如当前实现的是比较简单的http的Get请求的统计

第一个mapreduce：将时间、十六进制包头信息提取出来，并放在一行（这里涉及到mapreduce的键值对的对多行的特殊处理，是个值得注意的地方）

主要遇到两个问题：

　　一个数据包包含时间，包头的简单信息，包头的详细信息，初衷是想要把一个数据包的时间、包十六进制详细信息（存在于很多行里）按照顺序放置到一行，在java里面按行读取，很好实现。

针对mapreduce的键值对处理的特性，原来想到有两种方式解决：

（1）以时间的key值为准，一个包的信息key值与其相同

但MR的map每次只处理一行信息，而reduce只对键相同的行做处理，而且从map阶段到reduce的过程中有一个shuffle、sort阶段（估计是这个原因，也可能是因为离reduce近的机器处理完直接发给reduce，先到先处理），相同的key的value是乱序的。

(2)所有的key值递增

这样就没有相同的key值，无法放置到一行

最后的解决办法：

（3）以时间的key值为准，同一个包的信息的key值与其相同，但在十六进制行里加一个递增的id，放置到一行，虽然是乱序的，但自带ID，就重新排一下就好啦，妙！

第二个mapreduce: 对十六进制信息进行排序，是第一个mapreduce的补充，至此，清洗工作完毕，可以统计任意位置的十六进制来分析数据

第三个mapreduce：统计http发送的GET请求个数

static int id=1;

	static int hexId=1;

  public static class TokenizerMapper

       extends Mapper<Object, Text, IntWritable, Text>

 {

    private final static IntWritable one = new IntWritable(2);

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException

    {

    	//匹配时间

	 	String regexTime = "([0-2][0-4]):([0-5][0-9]):([0-5][0-9]).[0-9]{6}";// 11:08:56.149361

		Pattern patternTime = Pattern.compile(regexTime);

		Matcher matchTime = patternTime.matcher(value.toString());

		while (matchTime.find()) {

			String time ="time: " + matchTime.group()+" ";

			id=id+1;

			word.set(time);

			one.set(id);

			context.write(one, word);

		}

		//匹配十六进制

//		String regexHex = "0x[0-9]{4}:  ([A-Za-z0-9]{4} )+";

		String regexHex = " ([A-Za-z0-9]{4} )+";

		Pattern patternHex = Pattern.compile(regexHex);

		Matcher matchHex = patternHex.matcher(value.toString());

		while (matchHex.find()) {

			String hex = " "+ matchHex.group();

			hexId=hexId+1;

			hex="id:"+String.valueOf(hexId)+" "+hex;

			word.set(hex);

			one.set(id);

			context.write(one, word);

		}

    }

  }

  public static class IntSumReducer

       extends Reducer<IntWritable,Text,IntWritable,Text>

{

    private Text result = new Text();

    public void reduce(IntWritable key, Iterable<Text> values,

                       Context context

                       ) throws IOException, InterruptedException

  {

      String sum = "";

      for (Text val : values)

        {

          sum += val.toString();

         }

      result.set(sum);

      context.write(key, result);

    }

  }

public static class TokenizerMapper

       extends Mapper<Object, Text, Text, Text>

 {

    private final static Text one = new Text();

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException

    {

    	//匹配时间

	 	String regexTime = "time: ([0-2][0-4]):([0-5][0-9]):([0-5][0-9]).[0-9]{6}";// 11:08:56.149361

		Pattern patternTime = Pattern.compile(regexTime);

		Matcher matchTime = patternTime.matcher(value.toString());

		while (matchTime.find()) {

//			String time ="time: " + matchTime.group()+" ";

			String temptime =matchTime.group();

			String time =temptime.substring(6, temptime.length()-1);

			one.set(time);

		}

		//排序十六进制

//		String regexHex = "0x[0-9]{4}:  ([A-Za-z0-9]{4} )+";

		List<Bar> list = new ArrayList<Bar>();

		String regexHex = "id:([0-9])+   ([A-Za-z0-9]{4} )+";

		Pattern patternHex = Pattern.compile(regexHex);

		Matcher matchHex = patternHex.matcher(value.toString());

		while (matchHex.find()) {

			Bar bar = new Bar();

			String hexline = matchHex.group();

			String regexHex2 ="id:([0-9])+"; //一行十六进制的序号

			Pattern patternHex2 = Pattern.compile(regexHex2);

			Matcher matchHex2 = patternHex2.matcher(hexline);

			while (matchHex2.find()) {

				String lineId=matchHex2.group().toString().substring(3);

				bar.setId(lineId);

			}

			String regexHex3 ="([A-Za-z0-9]{4} )+"; //一行十六进制

			Pattern patternHex3 = Pattern.compile(regexHex3);

			Matcher matchHex3 = patternHex3.matcher(hexline);

			while (matchHex3.find()) {

				String lineHex= matchHex3.group().toString();

				bar.setHexValue(lineHex);

			}

			list.add(bar);

		}

		StringBuffer buffer = new StringBuffer("");

		 Collections.sort(list);

		for(int i=0;i<list.size();i++){

			Bar bar=list.get(i);

			String lineHex=bar.getHexValue();

			buffer.append(lineHex);

		}

		String hexOne= buffer.toString();

		word.set(hexOne);

		context.write(one, word);

    }

  }

  public static class IntSumReducer

       extends Reducer<Text,Text,Text,Text>

{

    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values,

                       Context context

                       ) throws IOException, InterruptedException

  {

      String sum = "";

      for (Text val : values)

        {

    	  context.write(key, val);

         }

    }

  }

	public static class TokenizerMapper extends

			Mapper<Object, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);

		private Text word = new Text("sumGet");

		public void map(Object key, Text value, Context context)

				throws IOException, InterruptedException {

			int timelen=15;

			int getlen=20*5+timelen;

			String strline=value.toString();

			if (strline.length() > getlen) {// ||hexValue[20].equals("4854")

				String getPos=strline.substring(timelen+20*5,timelen+21*5-1);

				 if(getPos.equals("4745")){

					 context.write(word, one);

				 }

			}

		}

	}

	public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values, Context context)

				throws IOException, InterruptedException {

			int sum =0;

			for (IntWritable val : values) {

				sum+=val.get();

			}

			result.set(sum);

			context.write(key, result);

		}

	}

Mapreuduce实现网络数据包的清洗工作的更多相关文章

用C++实现网络编程---抓取网络数据包的实现方法
一般都熟悉sniffer这个工具,它可以捕捉流经本地网卡的所有数据包.抓取网络数据包进行分析有很多用处,如分析网络是否有网络病毒等异常数据,通信协议的分析(数据链路层协议.IP.UDP.TCP.甚至各 ...
UNIX网络编程——网络数据包检测
网络数据包检测数据包捕获(sniffer):是指在网络上进行数据收集的行为,需要通过网卡来完成. 三种访问方式: BSD Packet Filter(BPF) SVR4 Datalink Provi ...
LINUX下的远端主机登入校园网络注册网络数据包转发和捕获
第一部分:LINUX 下的远端主机登入和校园网注册校园网内目的主机远程管理登入程序本程序为校园网内远程登入,管理功能,该程序分服务器端和客户端两部分:服务器端为remote_server_udp. ...
Linux 中的网络数据包捕获
Linux 中的网络数据包捕获 Ashish Chaurasia, 工程师简介: 本教程介绍了捕获和操纵数据包的不同机制.安全应用程序,如 VPN.防火墙和嗅探器,以及网络应用程序,如路由程序,都依 ...
Linux内核中网络数据包的接收-第一部分概念和框架
与网络数据包的发送不同,网络收包是异步的的.由于你不确定谁会在什么时候突然发一个网络包给你.因此这个网络收包逻辑事实上包括两件事:1.数据包到来后的通知2.收到通知并从数据包中获取数据这两件事发生在协 ...
网络数据包分析网卡Offload
http://blog.nsfocus.net/network-packets-analysis-nic-offload/ 对于网络安全来说,网络传输数据包的捕获和分析是个基础工作,绿盟科技研 ...
Linux内核网络数据包处理流程
Linux内核网络数据包处理流程 from kernel-4.9: 0. Linux内核网络数据包处理流程 - 网络硬件网卡工作在物理层和数据链路层,主要由PHY/MAC芯片.Tx/Rx FIFO. ...
sk_buff封装和解封装网络数据包的过程详解（转载）
http://dog250.blog.51cto.com/2466061/1612791 可以说sk_buff结构体是Linux网络协议栈的核心中的核心,几乎所有的操作都是围绕sk_buff这个结构体 ...
linux2.6.24内核源代码分析（2）——扒一扒网络数据包在链路层的流向路径之一
在2.6.24内核中链路层接收网络数据包出现了两种方法,第一种是传统方法,利用中断来接收网络数据包,适用于低速设备:第二种是New Api(简称NAPI)方法,利用了中断+轮询的方法来接收网络数据包, ...

随机推荐

Databricks缓存提升Spark性能--为什么NVMe固态硬盘能够提升10倍缓存性能（原创）
我们兴奋的宣布Databricks缓存的通用可用性,作为统一分析平台一部分的 Databricks 运行时特性,它可以将Spark工作负载的扫描速度提升10倍,并且这种改变无需任何代码修改. 1.在本 ...
【Thinkphp 5】整合邮箱类 phpmailer实现邮件发送
第一步:下载phpmailer文件,主要用到的文件只有箭头指向的两个,thinkphp5中,把class.phpmailer.php改成了phpmailer.php 第二步: 将phpmailer文件 ...
多对多中间表详解 -- Django从入门到精通系列教程
该系列教程系个人原创,并完整发布在个人官网刘江的博客和教程所有转载本文者,需在顶部显著位置注明原作者及www.liujiangblog.com官网地址. Python及Django学习QQ群:453 ...
基于SpringMVC+Mybatis搭建简单的前后台交互系统
前面博文有一篇名为基于tomcat+springMVC搭建基本的前后台交互系统(http://www.cnblogs.com/hunterCecil/p/6924935.html),例文中使用了Io ...
史上最大的CPU Bug（幽灵和熔断的OS&SQLServer补丁）
背景最近针对我们的处理器出现了一系列的严重的bug.这种bug导致了两个情况,就是熔断和幽灵. 这就是这几天闹得人心惶惶的CPU大Bug.消息显示,以英特尔处理器为代表的现代CPU中,存在可以导致数 ...
iOS-主线程刷新UI【+单例】
主线程刷新UI dispatch_async(dispatch_get_main_queue(), ^{ /// }); 单例 static Tools *_sharedManger; @implem ...
看图说话，P2P 分享率 90% 以上的 P2P-CDN 服务，来了！
事情是这样的:今年年初的时候,公司准备筹划一个直播项目,在原有的 APP 中嵌入直播模块,其中的一个问题就是直播加速服务的选取. 老板让我负责直播加速的产品选型,那天老板把我叫到办公室,语重心长地说: ...
洛谷 [P3384] 树链剖分模版
支持各种数据结构上树,注意取膜. #include <iostream> #include <cstring> #include <algorithm> #incl ...
bzoj 4869: [Shoi2017]相逢是问候 [扩展欧拉定理线段树]
4869: [Shoi2017]相逢是问候题意:一个序列,支持区间\(a_i \leftarrow c^{a_i}\),区间求和.在模p意义下. 类似于开根操作,每次取phi在log次后就不变了. ...
Windows Server 2016-Powershell迁移FSMO角色
上一章节我们讲到了通过Ntdsutil命令行进行FSMO角色迁移,本章开始之前我们先讨论一下有关FSMO角色放置建议: 建议将架构主机角色(Schema Master)和域命名主机角色(Domain ...

Mapreuduce实现网络数据包的清洗工作

Mapreuduce实现网络数据包的清洗工作的更多相关文章

随机推荐

热门专题