Big Data Security Part One: Introducing PacketPig
Series Introduction
Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based onApache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.
In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.
Introducing Packetpig
Intrusion detection is the analysis of network traffic to detect intruders on your network. Most intrusion detection systems (IDS) look for signatures of known attacks and identify them in real-time. Packetpig is different. Packetpig analyzes full packet captures – that is, logs of every single packet sent across your network – after the fact. In contrast to existing IDS systems, this means that using Hadoop on full packet captures, Packetpig can detect ‘zero day’ or unknown exploits on historical data as new exploits are discovered. Which is to say that Packetpig can determine whether intruders are already in your network, for how long, and what they’ve stolen or abused.
Packetpig is a Network Security Monitoring (NSM) Toolset where the ‘Big Data’ is full packet captures. Like a Tivo for your network, through its integration with Snort, p0f and custom java loaders, Packetpig does deep packet inspection, file extraction, feature extraction, operating system detection, and other deep network analysis. Packetpig’s analysis of full packet captures focuses on providing as much context as possible to the analyst. Context they have never had before. This is a ‘Big Data’ opportunity.
Full Packet Capture: A Big Data Opportunity
What makes full packet capture possible is cheap storage – the driving factor behind ‘big data.’ A standard 100Mbps internet connection can be cheaply logged for months with a 3TB disk. Apache Hadoop is optimized around cheap storage and data locality: putting spindles next to processor cores. And so what better way to analyze full packet captures than with Apache Pig – a dataflow scripting interface on top of Hadoop.
In the enterprise today, there is no single location or system to provide a comprehensive view of a network in terms of threats, sessions, protocols and files. This information is generally distributed across domain-specific systems such as IDS Correlation Engines and data stores, Netflow repositories, Bandwidth optimisation systems or Data Loss Prevention tools. Security Information and Event Monitoring systems offer to consolidate this information but they operate on logs – a digest or snippet of the original information. They don’t provide full fidelity information that can be queried using the exact copy of the original incident.
Packet captures are a standard binary format for storing network data. They are cheap to perform and the data can be stored in the cloud or on low-cost disk in the Enterprise network. The length of retention can be based on the amount of data flowing through the network each day and the window of time you want to be able to peer into the past.
Pig, Packetpig and Open Source Tools
In developing Packetpig, Packetloop wanted to provide free tools for the analysis network packet captures that spanned weeks, months or even years. The simple questions of capture and storage of network data had been solved but no one had addressed the fundamental problem of analysis. Packetpig utilizes the Hadoop stack for analysis, which solves this problem.
For us, wrapping Snort and p0f was a bit of a homage to how much security professionals value and rely on open source tools. We felt that if we didn’t offer an open source way of analysing full packet captures we had missed a real opportunity to pioneer in this area. We wanted it to be simple, turn key and easy for people to take our work and expand on it. This is why Apache Pig was selected for the project.
Understanding your Network
One of the first data sets we were given to analyse was a 3TB data set from a customer. It was every packet in and out of their 100Mbps internet connection for 6 weeks. It contained approximately 500,000 attacks. Making sense of this volume of information is incredibly difficult with current tooling. Even Network Security Monitoring (NSM) tools have difficult with this size of data. However it’s not just size and scale. No existing toolset allowed you to provide the same level of context. Packetpig allows you to join together information related to threats, sessions, protocols (deep packet inspection) and files as well as Geolocation and Operating system detection information.
We are currently logging all packets for a website for six months. This data set is currently around 0.6TB and because all the packet captures are stored in S3 we can quickly scan through the dataset. More importantly, we can run a job every nightly or every 15 minutes to correlate attack information with other data from Packetpig to provide an ultimate amount of context related to security events.
Items of interest include:
- Detecting anomalies and intrusion signatures
- Learn timeframe and identity of attacker
- Triage incidents
- “Show me packet captures I’ve never seen before.”
“Never before seen” is a powerful filter and isn’t limited to attack information. First introduced by Marcus Ranum,“never before seen” can be used to rule out normal network behaviour and only show sources, attacks, and traffic flows that are truly anomalous. For example, think in terms of the outbound communications from a Web Server. What attacks, clients and outbound communications are new or have never been seen before? In an instant you get an understanding that you don’t need to look for the normal, you are straight away looking for the abnormal or signs of misuse.
Agile Data
Packetloop uses the stack and iterative prototyping techniques outlined in the forthcoming book by Hortonworks’ ownRussell Jurney, Agile Data (O’Reilly, March 2013). We use Hadoop, Pig, Mongo and Cassandra to explore datasets and help us encode important information into d3 visualisations. Currently we use all of these tools to aid in our research before we add functionality to Packetloop. These prototypes become the palette our product is built from.
Big Data Security Part One: Introducing PacketPig的更多相关文章
- Cross-Domain Security For Data Vault
Cross-domain security for data vault is described. At least one database is accessible from a plural ...
- Automotive Security的一些资料和心得(5):Privacy
1. Introduction 1.1 "Customers own their data and we can be no more than the trsted stewards of ...
- Magic Quadrant for Security Information and Event Management
https://www.gartner.com/doc/reprints?id=1-4LC8PAW&ct=171130&st=sb Summary Security and risk ...
- Enabling granular discretionary access control for data stored in a cloud computing environment
Enabling discretionary data access control in a cloud computing environment can begin with the obtai ...
- Data Management Technology(1) -- Introduction
1.Database concepts (1)Data & Information Information Is any kind of event that affects the stat ...
- Use SQL to Query Data from CDS and Dynamics 365 CE
from : https://powerobjects.com/2020/05/20/use-sql-to-query-data-from-cds-and-dynamics-365-ce/ Have ...
- Awesome Hadoop
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, ...
- {ICIP2014}{收录论文列表}
This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...
- CNCF CloudNative Landscape
cncf landscape CNCF Cloud Native Interactive Landscape 1. App Definition and Development 1. Database ...
随机推荐
- 几何学中的欧拉公式:V-E+F = 2
几何学中的欧拉公式:V-E+F = 2,V.E.F表示简单几何体的顶点数.边数.面数. 证明: 它的证明有多种,这里呈现一种递归证法. 对于任意简单几何体(几何体的边界不是曲线),我们考察这个几何体的 ...
- 高性能JSON工具-FastJson处理超大JSON文本
使用阿里开源类库FastJson,当需要处理超大JSON文本时,需要Stream API,在fastjson-1.1.32版本中开始提供Stream API.文档参考GitHub:https://gi ...
- android开发步步为营之68:Facebook原生广告接入总结
开发应用的目的是干嘛?一方面当然是提供优质服务给用户,还有一方面最重要的还是须要有盈利.不然谁还有动力花钱花时间去开发app? 我们的应用主攻海外市场,所以主要还是接入国外的广告提供商.本文就今天刚完 ...
- 对不起,说句粗话——这个太屌了,windows1.0安装程序(附下载)
今天逛一个软件论坛发现的,仅仅有几百K.遥想当今我刚接触windows的版本号是3.1,当时记得非常清楚哦,进入windows要从dos命令行进入.如今一转眼,变成进入伪dos是执行栏里敲cmd了.唉 ...
- Spring Remoting by HTTP Invoker Example--reference
Spring provides its own implementation of remoting service known as HttpInvoker. It can be used for ...
- DecimalFormat用法
DecimalFormat用法 DecimalFormat 是 NumberFormat 的一个具体子类,用于格式化十进制数字. DecimalFormat 包含一个模式 和一组符号 符号含义: ...
- Android中使用ListView绘制自定义表格(2)
上回再写了<Android中使用ListView绘制自定义表格>后,很多人留言代码不全和没有数据样例.但因为项目原因,没法把源码全部贴上来.近两天,抽空简化了一下,做了一个例子. 效果图如 ...
- 配置Ssh免密码登录
配置Ssh免密码登录 一个master节点,两个client节点(client1.client2) 1.所有节点创建hadoop用户,并设置密码 以root账号登录: useradd hadoop p ...
- HDU -2546饭卡(01背包+贪心)
这道题有个小小的坎,就是低于5块不能选,大于5块,可以任意选,所以就在初始条件判断一下剩余钱数,然后如果大于5的话,这时候就要用到贪心的思想,只要大于等于5,先找最大的那个,然后剩下的再去用背包去选择 ...
- NYOJ 284 坦克大战 bfs + 优先队列
这类带权的边的图,直接广搜不行,要加上优先队列,这样得到的结果才是最优的,这样每次先找权值最小的,代码如下 #include <stdio.h> #include <iostream ...