『Python』爬取 WooYun 论坛所有漏洞条目的相关信息

每个漏洞条目包含：

乌云ID，漏洞标题，漏洞所属厂商，白帽子，漏洞类型，厂商或平台给的Rank值

主要是做数据分析使用：
可以分析某厂商的各类型漏洞的统计；
或者对白帽子的能力进行分析.....

数据更新时间：2016/5/27
漏洞条目：104796条

数据截图如下：

数据网盘链接：

链接：http://pan.baidu.com/s/1bpDNKOv 密码：6y57

爬虫脚本：

# coding:utf-8

# author: anka9080

# version: 1.0  py3

import sys,re,time,socket

from requests import get

from queue import Queue, Empty

from threading import Thread

# 全局变量

COUNT = 1

START_URL = 'http://wooyun.org/bugs'

ID_DETAILS = []

ALL_ID = []

Failed_ID = []

PROXIES = []

HEADERS = {

	"Accept": "text/html,application/xhtml+xml,application/xml,application/json;q=0.9,image/webp,*/*;q=0.8",

	"Accept-Encoding": "gzip, deflate, sdch",

	"Accept-Language": "zh-CN,zh;q=0.8",

	"Cache-Control": "max-age=0",

	"Connection": "keep-alive",

	"DNT": "1",

	"Host": "wooyun.org",

	"Upgrade-Insecure-Requests": "1",

	"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2716.0 Safari/537.36"

}

class WooYunSpider(Thread):

	"""docstring for WooYunSpider"""

	def __init__(self,queue):

		Thread.__init__(self)

		self.pattern1 = re.compile(r'title>(.*?)\| WooYun.*?keywords" content="(.*?),(.*?),(.*?),wooyun',re.S)  # 匹配模式在 compile 的时候指定

		self.pattern2 = re.compile(r"漏洞Rank：(\d{1,3})")

		self.queue = queue

		self.start() # 执行 run()

	def run(self):

		"每次读取 queue 的一条"

		global COUNT,RES_LOG,ERR_LOG

		while(1):

			try:

				id = self.queue.get(block = False)

				r = get('http://wooyun.org/bugs/' + id,headers = HEADERS)

				html = r.text

			except Empty:

				break

			except Exception as e:

				msg = '[ - Socket_Excpt ] 链接被拒绝，再次添加到队列：' + id

				print(msg)

				ERR_LOG.write(msg+'\n')

				self.queue.put(id)  # 访问失败则把这个 URL从新加入队列

			else:

				title,comp,author,bug_type,rank = self.get_detail(html,id)

				detail = id+'----'+title+'----'+comp+'----'+author+'----'+bug_type+'----'+rank

				try: # 写入文件可能会诱发 gbk 编码异常，这里保存 id 到 failed

					RES_LOG.write(detail + '\n')

				except Exception as e:

					Failed_ID.append(id)

					msg = '[ - Encode_Excpt ] 字符编码异常：' + id

					print(msg)

					ERR_LOG.write(msg+'\n')

				ID_DETAILS.append(detail)

			# time.sleep(1)

			print('[ - info ] id: {}  count: {}  time: {:.2f}s'.format(id,COUNT,time.time() - start))

			COUNT += 1

	# 由 缺陷编号 获得对应的 厂商 和 漏洞类型信息

	def get_detail(self,html,id):

		global ERR_LOG

		try:

			# print(html)

			res = self.pattern1.search(html)

			title = res.group(1).strip()

			comp = res.group(2).strip()

			author = res.group(3).strip()

			bug_type = res.group(4).strip()

		except Exception as e:

			msg = '[ - Detail_Excpt ] 未解析出 标题等相关信息：' + id

			print(msg)

			ERR_LOG.write(msg+'\n')

			Failed_ID.append(id)

			title,comp,author,bug_type,rank = 'Null','Null','Null','Null','Null'

		else:

			try:

				res2 = self.pattern2.search(html)  # 若厂商暂无回应则 rank 为 Null

				rank = res2.group(1).strip()

			except Exception as e:

				msg = '[ - Rank_Excpt ] 未解析出 Rank：' + id

				print(msg)

				ERR_LOG.write(msg+'\n')

				rank = 'Null'

		finally:

			try:

				print (title,comp,author,bug_type,rank)

			except Exception as e:

				msg = '[ - Print_Excpt ] 字符编码异常：' + id +'::'+ str(e)

				print(msg)

				ERR_LOG.write(msg+'\n')

			return title,comp,author,bug_type,rank

class ThreadPool(object):

	def __init__(self,thread_num,id_file):

		self.queue = Queue() # 需要执行的队列

		self.threads = [] # 多线程列表

		self.add_task(id_file)

		self.init_threads(thread_num)

	def add_task(self,id_file):

		with open(id_file) as input:

			for id in input.readlines():

				self.queue.put(id.strip())			

	def init_threads(self,thread_num):

		for i in range(thread_num):

			print ('[ - info :] loading threading ---> ',i)

			# time.sleep(1)

			self.threads.append(WooYunSpider(self.queue)) # threads 列表装的是 爬虫线程

	def wait(self):

		for t in self.threads:

			if t.isAlive():

				t.join()

def test():

	url = 'http://wooyun.org/bugs/wooyun-2016-0177647'

	r = get(url,headers = HEADERS)

	html = r.text

	# print type(html)

	# keywords" content="(.*?),(.*?),(.*?),wooyun  ====> 厂商，白帽子，类型

	pattern1 = re.compile(r'title>(.*?)\| WooYun')

	pattern2 = re.compile(r'keywords" content="(.*?),(.*?),(.*?),wooyun')

	pattern3 = re.compile(r'漏洞Rank：(\d{1,3})')

	for x in range(500):

		res = pattern1.search(html)

		# print (res.group(1))

		res = pattern2.search(html)

		# print (res.group(1),res.group(2),res.group(3))

		res = pattern3.search(html)

		# print (res.group(1))

		x += 1

		print(x)

	# rank = res.group(4).strip()

	# print html

def test2():

	url = 'http://wooyun.org/bugs/wooyun-2016-0177647'

	r = get(url,headers = HEADERS)

	html = r.text

	pattern = re.compile(r'title>(.*?)\| WooYun.*?keywords" content="(.*?),(.*?),(.*?),wooyun.*?漏洞Rank：(\d{1,3})',re.S)

	for x in range(500):

		res = pattern.search(html)

		# print (res.group(1),res.group(2),res.group(3),res.group(4),res.group(5))

		x += 1

		print(x)

# 保存结果

def save2file(filename,filename_failed_id):

	with open(filename,'w') as output:

		for item in ID_DETAILS:

			try: # 写入文件可能会诱发 gbk 编码异常，这里忽略

				output.write(item + '\n')

			except Exception as e:

				pass

	with open(filename_failed_id,'w') as output:

		output.write('\n'.join(Failed_ID))

if __name__ == '__main__':

	socket.setdefaulttimeout(1)

	start = time.time()

	# test()

	# 日志记录

	ERR_LOG = open('err_log.txt','w')

	RES_LOG = open('res_log.txt','w')

	id_file = 'id_0526.txt'

	# id_file = 'id_test.txt'

	tp = ThreadPool(20,id_file)

	tp.wait()

	save2file('id_details.txt','failed_id.txt')

	end = time.time()

	print ('[ - info ] cost time :{:.2f}s'.format(end - start))

『Python』爬取 WooYun 论坛所有漏洞条目的相关信息的更多相关文章

【Python】爬取理想论坛单帖爬虫
代码: # 单帖爬虫,用于爬取理想论坛帖子得到发帖人,发帖时间和回帖时间,url例子见main函数 from bs4 import BeautifulSoup import requests impo ...
『Scrapy』爬取斗鱼主播头像
分析目标爬取的是斗鱼主播头像,示范使用的URL似乎是个移动接口(下文有提到),理由是网页主页属于动态页面,爬取难度陡升,当然爬取斗鱼主播头像这么恶趣味的事也不是我的兴趣...... 目标URL如下, ...
『Scrapy』爬取腾讯招聘网站
分析爬取对象初始网址, http://hr.tencent.com/position.php?@start=0&start=0#a (可选)由于含有多页数据,我们可以查看一下这些网址有什么相 ...
python scrapy爬取HBS 汉堡南美航运公司柜号信息
下面分享个scrapy的例子利用scrapy爬取HBS 船公司柜号信息 1.前期准备查询提单号下的柜号有哪些,主要是在下面的网站上,输入提单号,然后点击查询 https://www.hamburg ...
大神：python怎么爬取js的页面
大神:python怎么爬取js的页面可以试试抓包看看它请求了哪些东西, 很多时候可以绕过网页直接请求后面的API 实在不行就上 selenium (selenium大法好) selenium和pha ...
python连续爬取多个网页的图片分别保存到不同的文件夹
python连续爬取多个网页的图片分别保存到不同的文件夹作者:vpoet mail:vpoet_sir@163.com #coding:utf-8 import urllib import ur ...
python定时器爬取豆瓣音乐Top榜歌名
python定时器爬取豆瓣音乐Top榜歌名作者:vpoet mail:vpoet_sir@163.com 注:这些小demo都是前段时间为了学python写的,现在贴出来纯粹是为了和大家分享一下 # ...
python大规模爬取京东
python大规模爬取京东主要工具 scrapy BeautifulSoup requests 分析步骤打开京东首页,输入裤子将会看到页面跳转到了这里,这就是我们要分析的起点我们可以看到这个页面 ...
Python爬虫 - 爬取百度html代码前200行
Python爬虫 - 爬取百度html代码前200行 - 改进版, 增加了对字符串的.strip()处理源代码如下: # 改进版, 增加了 .strip()方法的使用 # coding=utf-8 ...

随机推荐

Linux下多任务间通信和同步-概述
Linux下多任务间通信和同步-概述嵌入式开发交流群280352802,欢迎加入! 在前面,我们学习了两种多任务的实现手段:进程和线程.由于进程是工作在独立的内存空间中,不同的进程间不能直接访问到对 ...
JavaScript交换两个变量值的七种解决方案
前言这篇文章总结了七种办法来交换a和b的变量值 1 2 var a = 123; var b = 456; 交换变量值方案一最最最简单的办法就是使用一个临时变量了,不过使用临时变量的方法实在是太l ...
10个经典的Java面试题
这里有10个经典的Java面试题,也为大家列出了答案.这是Java开发人员面试经常容易遇到的问题,相信你了解和掌握之后一定会有所提高.让我们一起来看看吧. 1.Java的HashMap是如何工作的? ...
log4jdbc与logback集合打印日志过多的解决
在项目中使用了log4jdbc,可以很方便的把sql的参数也打印出来,便于问题调试.比如原始sql: select * from t_order where order_id = ? : 经过log4 ...
AbpZero--4.不使用谷歌字体，提升加载速度
jtable控件样式中会使用到谷歌字体,每次访问都特别慢 1.打开jtable.css文件 [..\MyCompanyName.AbpZeroTemplate.Web\libs\jquery-jtab ...
Struts2 入门（新手必看）
船舶停靠在港湾是很安全的,但这不是造船的目的 Struts 2及其优势 Struts 2是一个MVC框架,以WebWork框架的设计思想为核心,吸收了Struts 1的部分优点 Struts ...
hdu4010 Query On The Trees
Problem Description We have met so many problems on the tree, so today we will have a query problem ...
caffe源代码分析--math_functions.cu代码研究
当中用到一个宏定义CUDA_KERNEL_LOOP 在common.hpp中有. #defineCUDA_KERNEL_LOOP(i,n) \ for(inti = blockIdx.x * bloc ...
【开源java游戏框架libgdx专题】-14-系统控件-Skin类
Skin类主要用于存储用户界面的资源,该资源主要用于窗口部件.这些资源也包括纹理图片.位图画笔.颜色等内容.方便创建游戏组件,同时使用Skin也可以批量的粗略处理一些窗口部件. test.json { ...
解决Windows8前面板耳机无声的问题
Windows8已经到来不久了,相信很多朋友已经在使用,在使用时也许会遇到前面板耳机无声的问题,网上的其他办法很麻烦还不一定能解决,在这里我会给大家提供最简单的办法解决这个问题. 百度经验:jingy ...

『Python』 爬取 WooYun 论坛所有漏洞条目的相关信息

『Python』 爬取 WooYun 论坛所有漏洞条目的相关信息的更多相关文章

随机推荐

热门专题

『Python』爬取 WooYun 论坛所有漏洞条目的相关信息

『Python』爬取 WooYun 论坛所有漏洞条目的相关信息的更多相关文章