Python爬虫学习——布隆过滤器
布隆过滤器的实现方法1:自己实现
参考 http://www.cnblogs.com/naive/p/5815433.html
bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数
#coding:utf-8
#!/usr/bin/env python from bitarray import bitarray
# 3rd party
import mmh3
import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesep class BloomFilter(set): def __init__(self, size, hash_count):
super(BloomFilter, self).__init__()
self.bit_array = bitarray(size)
self.bit_array.setall(0)
self.size = size
self.hash_count = hash_count def __len__(self):
return self.size def __iter__(self):
return iter(self.bit_array) def add(self, item):
for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
self.bit_array[index] = 1 return self def __contains__(self, item):
out = True
for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
if self.bit_array[index] == 0:
out = False return out class DmozSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = [
"http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"
] def parse(self, response): # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"
#
# html = response.xpath('//html').extract()[0]
# fobj = open(fname, 'w')
# fobj.writelines(html.encode('utf-8'))
# fobj.close() bloom = BloomFilter(1000, 10)
animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
# First insertion of animals into the bloom filter
for animal in animals:
bloom.add(animal) # Membership existence for already inserted animals
# There should not be any false negatives
for animal in animals:
if animal in bloom:
print('{} is in bloom filter as expected'.format(animal))
else:
print('Something is terribly went wrong for {}'.format(animal))
print('FALSE NEGATIVE!') # Membership existence for not inserted animals
# There could be false positives
other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
'hawk']
for other_animal in other_animals:
if other_animal in bloom:
print('{} is not in the bloom, but a false positive'.format(other_animal))
else:
print('{} is not in the bloom filter as expected'.format(other_animal))
布隆过滤器的实现方法2:使用pybloom
参考 http://www.jianshu.com/p/f57187e2b5b9
#coding:utf-8
#!/usr/bin/env python from pybloom import BloomFilter import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesep class DmozSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = [
"http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"
] def parse(self, response): # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"
#
# html = response.xpath('//html').extract()[0]
# fobj = open(fname, 'w')
# fobj.writelines(html.encode('utf-8'))
# fobj.close() # bloom = BloomFilter(100, 10)
bloom = BloomFilter(1000, 0.001)
animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
# First insertion of animals into the bloom filter
for animal in animals:
bloom.add(animal) # Membership existence for already inserted animals
# There should not be any false negatives
for animal in animals:
if animal in bloom:
print('{} is in bloom filter as expected'.format(animal))
else:
print('Something is terribly went wrong for {}'.format(animal))
print('FALSE NEGATIVE!') # Membership existence for not inserted animals
# There could be false positives
other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
'hawk']
for other_animal in other_animals:
if other_animal in bloom:
print('{} is not in the bloom, but a false positive'.format(other_animal))
else:
print('{} is not in the bloom filter as expected'.format(other_animal))
输出
dog is in bloom filter as expected
cat is in bloom filter as expected
giraffe is in bloom filter as expected
fly is in bloom filter as expected
mosquito is in bloom filter as expected
horse is in bloom filter as expected
eagle is in bloom filter as expected
bird is in bloom filter as expected
bison is in bloom filter as expected
boar is in bloom filter as expected
butterfly is in bloom filter as expected
ant is in bloom filter as expected
anaconda is in bloom filter as expected
bear is in bloom filter as expected
chicken is in bloom filter as expected
dolphin is in bloom filter as expected
donkey is in bloom filter as expected
crow is in bloom filter as expected
crocodile is in bloom filter as expected
badger is not in the bloom filter as expected
cow is not in the bloom filter as expected
pig is not in the bloom filter as expected
sheep is not in the bloom filter as expected
bee is not in the bloom filter as expected
wolf is not in the bloom filter as expected
fox is not in the bloom filter as expected
whale is not in the bloom filter as expected
shark is not in the bloom filter as expected
fish is not in the bloom filter as expected
turkey is not in the bloom filter as expected
duck is not in the bloom filter as expected
dove is not in the bloom filter as expected
deer is not in the bloom filter as expected
elephant is not in the bloom filter as expected
frog is not in the bloom filter as expected
falcon is not in the bloom filter as expected
goat is not in the bloom filter as expected
gorilla is not in the bloom filter as expected
hawk is not in the bloom filter as expected
Python爬虫学习——布隆过滤器的更多相关文章
- python爬虫学习(1) —— 从urllib说起
0. 前言 如果你从来没有接触过爬虫,刚开始的时候可能会有些许吃力 因为我不会从头到尾把所有知识点都说一遍,很多文章主要是记录我自己写的一些爬虫 所以建议先学习一下cuiqingcai大神的 Pyth ...
- python爬虫学习 —— 总目录
开篇 作为一个C党,接触python之后学习了爬虫. 和AC算法题的快感类似,从网络上爬取各种数据也很有意思. 准备写一系列文章,整理一下学习历程,也给后来者提供一点便利. 我是目录 听说你叫爬虫 - ...
- Python爬虫学习:三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程 一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...
- Python爬虫学习:四、headers和data的获取
之前在学习爬虫时,偶尔会遇到一些问题是有些网站需要登录后才能爬取内容,有的网站会识别是否是由浏览器发出的请求. 一.headers的获取 就以博客园的首页为例:http://www.cnblogs.c ...
- Python爬虫学习:二、爬虫的初步尝试
我使用的编辑器是IDLE,版本为Python2.7.11,Windows平台. 本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:二.爬虫的初步尝试 1.尝试抓取指定网页 ...
- 《Python爬虫学习系列教程》学习笔记
http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己 ...
- python爬虫学习视频资料免费送,用起来非常666
当我们浏览网页的时候,经常会看到像下面这些好看的图片,你是否想把这些图片保存下载下来. 我们最常规的做法就是通过鼠标右键,选择另存为.但有些图片点击鼠标右键的时候并没有另存为选项,或者你可以通过截图工 ...
- python爬虫学习笔记(一)——环境配置(windows系统)
在进行python爬虫学习前,需要进行如下准备工作: python3+pip官方配置 1.Anaconda(推荐,包括python和相关库) [推荐地址:清华镜像] https://mirrors ...
- [转]《Python爬虫学习系列教程》
<Python爬虫学习系列教程>学习笔记 http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多. ...
随机推荐
- echarts设置toolTip大小和样式问题
最近研究echarts,发现提示框太大,位置不合适问题, 用jq,css选中div的tooltip设置大小有时候不管用: 查了官网文档 http://echarts.baidu.com/option. ...
- pom.xml将jar包导入
2.5是Maven的版本
- 实例化和设置一个优秀的php对象
类是用于生成对象的代码模板,对象可以被说成是类的"实例" class ShopProduct{ public $title = 'default product'; // 属性也称 ...
- 第二章 flex处理二义性
大多数flex程序有二义性,相同的输入可能被多种模式匹配 flex通过下面2个规则来解决 匹配尽可能长的字符 如果2个模式都可以匹配, 匹配更早出现的那个模式 例子 "+" { r ...
- elastic-job 新手指南
大多数情况下,定时任务我们一般使用quartz开源框架就能满足应用场景.但如果考虑到健壮性等其它一些因素,就需要自己下点工夫,比如:要避免单点故障,至少得部署2个节点吧,但是部署多个节点,又有其它问题 ...
- 系统wmiprvse.exe占用CPU非常高,求解决
1.wmiprvse.exe是微软Windows操作系统的一部分.用于通过WinMgmt.exe程序处理WMI操作.文件位置有二处: C:\WINDOWS\system32\wbem\wmiprvse ...
- yum install --downloadonly 下载依赖包到本地 但不安装
如果手动去一个个找依赖是很困难的,即便已经知道名字.版本,下面就依赖系统自带的命令完成该步骤 以java为例,其他安装包只要替换包名 yum install --downloadonly --down ...
- minipad2
minipad2 是一款小巧的纯文本笔记软件,系统资源占用少,集笔记 / 便笺.计算器.备忘录.电子词典.快启面板.通讯录.文字模板.多重剪贴板等多种功能于一体,所有内容自动保存,关闭时自动记忆最后的 ...
- vue Object.defineProperty Proxy 数据双向绑定
Object.defineProperty 虽然已经能够实现双向绑定了,但是他还是有缺陷的. 只能对属性进行数据劫持,所以需要深度遍历整个对象 对于数组不能监听到数据的变化 虽然 Vue 中确实能检测 ...
- JAVA中通过时间格式来生成唯一的文件名
有时候我们需要截图,在要截图时,有人用到了时间格式,但是时间格式中的:在文件名称中是不被允许的字符,所以就会报错,如何生成唯一的时间文件名: package com.demo; import java ...