Python爬虫学习——布隆过滤器

布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

#coding:utf-8

#!/usr/bin/env python

from bitarray import bitarray

# 3rd party

import mmh3

import scrapy

from BeautifulSoup import BeautifulSoup as BS

import os

ls = os.linesep

class BloomFilter(set):

    def __init__(self, size, hash_count):

        super(BloomFilter, self).__init__()

        self.bit_array = bitarray(size)

        self.bit_array.setall(0)

        self.size = size

        self.hash_count = hash_count

    def __len__(self):

        return self.size

    def __iter__(self):

        return iter(self.bit_array)

    def add(self, item):

        for ii in range(self.hash_count):

            index = mmh3.hash(item, ii) % self.size

            self.bit_array[index] = 1

        return self

    def __contains__(self, item):

        out = True

        for ii in range(self.hash_count):

            index = mmh3.hash(item, ii) % self.size

            if self.bit_array[index] == 0:

                out = False

        return out

class DmozSpider(scrapy.Spider):

    name = "baidu"

    allowed_domains = ["baidu.com"]

    start_urls = [

        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"

    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"

        #

        # html = response.xpath('//html').extract()[0]

        # fobj = open(fname, 'w')

        # fobj.writelines(html.encode('utf-8'))

        # fobj.close()

        bloom = BloomFilter(1000, 10)

        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',

                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',

                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']

        # First insertion of animals into the bloom filter

        for animal in animals:

            bloom.add(animal)

        # Membership existence for already inserted animals

        # There should not be any false negatives

        for animal in animals:

            if animal in bloom:

                print('{} is in bloom filter as expected'.format(animal))

            else:

                print('Something is terribly went wrong for {}'.format(animal))

                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals

        # There could be false positives

        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',

                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',

                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',

                         'hawk']

        for other_animal in other_animals:

            if other_animal in bloom:

                print('{} is not in the bloom, but a false positive'.format(other_animal))

            else:

                print('{} is not in the bloom filter as expected'.format(other_animal))

布隆过滤器的实现方法2:使用pybloom

参考 http://www.jianshu.com/p/f57187e2b5b9

#coding:utf-8

#!/usr/bin/env python

from pybloom import BloomFilter

import scrapy

from BeautifulSoup import BeautifulSoup as BS

import os

ls = os.linesep

class DmozSpider(scrapy.Spider):

    name = "baidu"

    allowed_domains = ["baidu.com"]

    start_urls = [

        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"

    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"

        #

        # html = response.xpath('//html').extract()[0]

        # fobj = open(fname, 'w')

        # fobj.writelines(html.encode('utf-8'))

        # fobj.close()

        # bloom = BloomFilter(100, 10)

        bloom = BloomFilter(1000, 0.001)

        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',

                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',

                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']

        # First insertion of animals into the bloom filter

        for animal in animals:

            bloom.add(animal)

        # Membership existence for already inserted animals

        # There should not be any false negatives

        for animal in animals:

            if animal in bloom:

                print('{} is in bloom filter as expected'.format(animal))

            else:

                print('Something is terribly went wrong for {}'.format(animal))

                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals

        # There could be false positives

        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',

                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',

                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',

                         'hawk']

        for other_animal in other_animals:

            if other_animal in bloom:

                print('{} is not in the bloom, but a false positive'.format(other_animal))

            else:

                print('{} is not in the bloom filter as expected'.format(other_animal))

输出

dog is in bloom filter as expected

cat is in bloom filter as expected

giraffe is in bloom filter as expected

fly is in bloom filter as expected

mosquito is in bloom filter as expected

horse is in bloom filter as expected

eagle is in bloom filter as expected

bird is in bloom filter as expected

bison is in bloom filter as expected

boar is in bloom filter as expected

butterfly is in bloom filter as expected

ant is in bloom filter as expected

anaconda is in bloom filter as expected

bear is in bloom filter as expected

chicken is in bloom filter as expected

dolphin is in bloom filter as expected

donkey is in bloom filter as expected

crow is in bloom filter as expected

crocodile is in bloom filter as expected

badger is not in the bloom filter as expected

cow is not in the bloom filter as expected

pig is not in the bloom filter as expected

sheep is not in the bloom filter as expected

bee is not in the bloom filter as expected

wolf is not in the bloom filter as expected

fox is not in the bloom filter as expected

whale is not in the bloom filter as expected

shark is not in the bloom filter as expected

fish is not in the bloom filter as expected

turkey is not in the bloom filter as expected

duck is not in the bloom filter as expected

dove is not in the bloom filter as expected

deer is not in the bloom filter as expected

elephant is not in the bloom filter as expected

frog is not in the bloom filter as expected

falcon is not in the bloom filter as expected

goat is not in the bloom filter as expected

gorilla is not in the bloom filter as expected

hawk is not in the bloom filter as expected

Python爬虫学习——布隆过滤器的更多相关文章

python爬虫学习(1) —— 从urllib说起
0. 前言如果你从来没有接触过爬虫,刚开始的时候可能会有些许吃力因为我不会从头到尾把所有知识点都说一遍,很多文章主要是记录我自己写的一些爬虫所以建议先学习一下cuiqingcai大神的 Pyth ...
python爬虫学习 —— 总目录
开篇作为一个C党,接触python之后学习了爬虫. 和AC算法题的快感类似,从网络上爬取各种数据也很有意思. 准备写一系列文章,整理一下学习历程,也给后来者提供一点便利. 我是目录听说你叫爬虫 - ...
Python爬虫学习：三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...
Python爬虫学习：四、headers和data的获取
之前在学习爬虫时,偶尔会遇到一些问题是有些网站需要登录后才能爬取内容,有的网站会识别是否是由浏览器发出的请求. 一.headers的获取就以博客园的首页为例:http://www.cnblogs.c ...
Python爬虫学习：二、爬虫的初步尝试
我使用的编辑器是IDLE,版本为Python2.7.11,Windows平台. 本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:二.爬虫的初步尝试 1.尝试抓取指定网页 ...
《Python爬虫学习系列教程》学习笔记
http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己 ...
python爬虫学习视频资料免费送，用起来非常666
当我们浏览网页的时候,经常会看到像下面这些好看的图片,你是否想把这些图片保存下载下来. 我们最常规的做法就是通过鼠标右键,选择另存为.但有些图片点击鼠标右键的时候并没有另存为选项,或者你可以通过截图工 ...
python爬虫学习笔记（一）——环境配置（windows系统）
在进行python爬虫学习前,需要进行如下准备工作: python3+pip官方配置 1.Anaconda(推荐,包括python和相关库) [推荐地址:清华镜像] https://mirrors ...
[转]《Python爬虫学习系列教程》
<Python爬虫学习系列教程>学习笔记 http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多. ...

随机推荐

潭州课堂25班：Ph201805201 WEB 之页面编写第四课登录注册 (课堂笔记)
index.html 首页 <!DOCTYPE html> <html lang="en"> <head> <meta charset=& ...
Loadrunner11的安装方法和破解
前提:我使用的是在虚拟机下安装32位win7系统,浏览器为IE8(为录制脚本做铺垫) LoadRunner,是一种预测系统行为和性能的负载测试工具.通过以模拟上千万用户实施并发负载及实时性能监测的方式 ...
喵哈哈村的魔法考试 Round #19 (Div.2) 题解
题解: 喵哈哈村的魔力源泉(1) 题解:签到题. 代码: #include<bits/stdc++.h> using namespace std; int main(){ long lon ...
ROS知识（24）——ros::spin()和spinOnce的含义及其区别
1.ros::spin()和spinOnce含义如果在节点中如果订阅了话题,那么就必须要调用ros::sping()或者ros::spinOnce()函数去处理话题绑定的回调函数,否则该节点将不会调 ...
Structured Streaming教程(3) —— 与Kafka的集成
Structured Streaming最主要的生产环境应用场景就是配合kafka做实时处理,不过在Strucured Streaming中kafka的版本要求相对搞一些,只支持0.10及以上的版本. ...
CentOS下创建网桥
说明:以下创建的是永久网桥,即重启后依然生效. 0.安装网桥的依赖 yum -y install tunctl bridge-utils 1.创建网桥配置文件 UUID=`uuidgen` cat & ...
AnguarJS中链式的一种更合理写法
假设有这样的一个场景: 我们知道一个用户某次航班,抽象成一个departure,大致是: {userID : user.email,flightID : "UA_343223",d ...
uva 1629切蛋糕（dp）
有一个n行m列的网格蛋糕,上面有一些樱桃.求使得每块蛋糕上都有一个樱桃的分割最小长度思路:dp. #include<cstdio> #include<cstring> #in ...
MDX Cookbook 11 - 计算 Year Over Year 增长 (同比计算) ParallelPeriod
这一小节主要介绍如何在一个平行期间的度量值,当前值的对比对象是指当前值的上一年,上一个季度或者其它时间级别上与当前值同一时间点上的的那个对象.有一个非常常见的需求就是对比上一年同一个时间点的某个值来判 ...
redis内部数据结构深入浅出
最大感受,无论从设计还是源码,Redis都尽量做到简单,其中运用到的原理也通俗易懂.特别是源码,简洁易读,真正做到clean and clear, 这篇文章以unstable分支的源码为基准,先从大体 ...

Python爬虫学习——布隆过滤器

Python爬虫学习——布隆过滤器的更多相关文章

随机推荐

热门专题