Python爬虫学习——布隆过滤器

布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

#coding:utf-8

#!/usr/bin/env python

from bitarray import bitarray

# 3rd party

import mmh3

import scrapy

from BeautifulSoup import BeautifulSoup as BS

import os

ls = os.linesep

class BloomFilter(set):

    def __init__(self, size, hash_count):

        super(BloomFilter, self).__init__()

        self.bit_array = bitarray(size)

        self.bit_array.setall(0)

        self.size = size

        self.hash_count = hash_count

    def __len__(self):

        return self.size

    def __iter__(self):

        return iter(self.bit_array)

    def add(self, item):

        for ii in range(self.hash_count):

            index = mmh3.hash(item, ii) % self.size

            self.bit_array[index] = 1

        return self

    def __contains__(self, item):

        out = True

        for ii in range(self.hash_count):

            index = mmh3.hash(item, ii) % self.size

            if self.bit_array[index] == 0:

                out = False

        return out

class DmozSpider(scrapy.Spider):

    name = "baidu"

    allowed_domains = ["baidu.com"]

    start_urls = [

        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"

    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"

        #

        # html = response.xpath('//html').extract()[0]

        # fobj = open(fname, 'w')

        # fobj.writelines(html.encode('utf-8'))

        # fobj.close()

        bloom = BloomFilter(1000, 10)

        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',

                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',

                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']

        # First insertion of animals into the bloom filter

        for animal in animals:

            bloom.add(animal)

        # Membership existence for already inserted animals

        # There should not be any false negatives

        for animal in animals:

            if animal in bloom:

                print('{} is in bloom filter as expected'.format(animal))

            else:

                print('Something is terribly went wrong for {}'.format(animal))

                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals

        # There could be false positives

        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',

                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',

                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',

                         'hawk']

        for other_animal in other_animals:

            if other_animal in bloom:

                print('{} is not in the bloom, but a false positive'.format(other_animal))

            else:

                print('{} is not in the bloom filter as expected'.format(other_animal))

布隆过滤器的实现方法2:使用pybloom

参考 http://www.jianshu.com/p/f57187e2b5b9

#coding:utf-8

#!/usr/bin/env python

from pybloom import BloomFilter

import scrapy

from BeautifulSoup import BeautifulSoup as BS

import os

ls = os.linesep

class DmozSpider(scrapy.Spider):

    name = "baidu"

    allowed_domains = ["baidu.com"]

    start_urls = [

        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"

    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"

        #

        # html = response.xpath('//html').extract()[0]

        # fobj = open(fname, 'w')

        # fobj.writelines(html.encode('utf-8'))

        # fobj.close()

        # bloom = BloomFilter(100, 10)

        bloom = BloomFilter(1000, 0.001)

        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',

                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',

                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']

        # First insertion of animals into the bloom filter

        for animal in animals:

            bloom.add(animal)

        # Membership existence for already inserted animals

        # There should not be any false negatives

        for animal in animals:

            if animal in bloom:

                print('{} is in bloom filter as expected'.format(animal))

            else:

                print('Something is terribly went wrong for {}'.format(animal))

                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals

        # There could be false positives

        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',

                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',

                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',

                         'hawk']

        for other_animal in other_animals:

            if other_animal in bloom:

                print('{} is not in the bloom, but a false positive'.format(other_animal))

            else:

                print('{} is not in the bloom filter as expected'.format(other_animal))

输出

dog is in bloom filter as expected

cat is in bloom filter as expected

giraffe is in bloom filter as expected

fly is in bloom filter as expected

mosquito is in bloom filter as expected

horse is in bloom filter as expected

eagle is in bloom filter as expected

bird is in bloom filter as expected

bison is in bloom filter as expected

boar is in bloom filter as expected

butterfly is in bloom filter as expected

ant is in bloom filter as expected

anaconda is in bloom filter as expected

bear is in bloom filter as expected

chicken is in bloom filter as expected

dolphin is in bloom filter as expected

donkey is in bloom filter as expected

crow is in bloom filter as expected

crocodile is in bloom filter as expected

badger is not in the bloom filter as expected

cow is not in the bloom filter as expected

pig is not in the bloom filter as expected

sheep is not in the bloom filter as expected

bee is not in the bloom filter as expected

wolf is not in the bloom filter as expected

fox is not in the bloom filter as expected

whale is not in the bloom filter as expected

shark is not in the bloom filter as expected

fish is not in the bloom filter as expected

turkey is not in the bloom filter as expected

duck is not in the bloom filter as expected

dove is not in the bloom filter as expected

deer is not in the bloom filter as expected

elephant is not in the bloom filter as expected

frog is not in the bloom filter as expected

falcon is not in the bloom filter as expected

goat is not in the bloom filter as expected

gorilla is not in the bloom filter as expected

hawk is not in the bloom filter as expected

Python爬虫学习——布隆过滤器的更多相关文章

python爬虫学习(1) —— 从urllib说起
0. 前言如果你从来没有接触过爬虫,刚开始的时候可能会有些许吃力因为我不会从头到尾把所有知识点都说一遍,很多文章主要是记录我自己写的一些爬虫所以建议先学习一下cuiqingcai大神的 Pyth ...
python爬虫学习 —— 总目录
开篇作为一个C党,接触python之后学习了爬虫. 和AC算法题的快感类似,从网络上爬取各种数据也很有意思. 准备写一系列文章,整理一下学习历程,也给后来者提供一点便利. 我是目录听说你叫爬虫 - ...
Python爬虫学习：三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...
Python爬虫学习：四、headers和data的获取
之前在学习爬虫时,偶尔会遇到一些问题是有些网站需要登录后才能爬取内容,有的网站会识别是否是由浏览器发出的请求. 一.headers的获取就以博客园的首页为例:http://www.cnblogs.c ...
Python爬虫学习：二、爬虫的初步尝试
我使用的编辑器是IDLE,版本为Python2.7.11,Windows平台. 本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:二.爬虫的初步尝试 1.尝试抓取指定网页 ...
《Python爬虫学习系列教程》学习笔记
http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己 ...
python爬虫学习视频资料免费送，用起来非常666
当我们浏览网页的时候,经常会看到像下面这些好看的图片,你是否想把这些图片保存下载下来. 我们最常规的做法就是通过鼠标右键,选择另存为.但有些图片点击鼠标右键的时候并没有另存为选项,或者你可以通过截图工 ...
python爬虫学习笔记（一）——环境配置（windows系统）
在进行python爬虫学习前,需要进行如下准备工作: python3+pip官方配置 1.Anaconda(推荐,包括python和相关库) [推荐地址:清华镜像] https://mirrors ...
[转]《Python爬虫学习系列教程》
<Python爬虫学习系列教程>学习笔记 http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多. ...

随机推荐

BZOJ.2816.[ZJOI2012]网络(LCT)
题目链接 BZOJ 洛谷对每种颜色维护一个LCT,保存点之间的连接关系. 修改权值A[x]和所有Max[x]都要改: 修改边的颜色先枚举所有颜色,看是否在某种颜色中有边,然后断开.(枚举一遍就行啊 ...
extend与append的区别
''' list 的两个方法extend 和 append 看起来类似,但实际上完全不同. extend接受一个参数,这个参数,总是一个list,并把list中的每个元素添加到原list中 appen ...
用Canvas做视频拼图
声明:本文为原创文章,如需转载,请注明来源WAxes,谢谢! 几天前同事给我看了一个特效,是一个拼图游戏,不同的是,拼图里的是动画.他让我看下做个DEMO,于是就自己整了一会,也确实不难.用canva ...
Codeforces 932G Palindrome Partition 回文树+DP
题意:给定一个串,把串分为偶数段假设分为$s_1,s_2,s_3....s_k$ 求满足$ s_1=s_k,s_2=s_{ k-1 }... $的方案数模$10^9+7$ $|S|\leq 10^6 ...
19 个必须知道的 Visual Studio 快捷键
项目相关的快捷键 Ctrl + Shift + B = 生成项目 Ctrl + Alt + L = 显示Solution Explorer(解决方案资源管理器) Shift + Alt+ C = 添加 ...
[Beego模型] 四、使用SQL语句进行查询
[Beego模型] 一.ORM 使用方法 [Beego模型] 二.CRUD 操作 [Beego模型] 三.高级查询 [Beego模型] 四.使用SQL语句进行查询 [Beego模型] 五.构造查询 [ ...
将控件画成圆角的效果(Delphi)
最近在做一个Delphi的项目,常常要设计软件的界面,需要将控件画成圆角矩形.在Delphi中将控件画成圆角效果,可使用CreateRoundRectRgn函数.在此写了一个通用的函数,只要在用到改变 ...
YUV 4:2:0 格式和YUV411格式区别
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/coloriy/article/details/6668447 MPEG 储存的 YU(Cb)V(Cr ...
C# ConcurrentQueue实现
我们从C# Queue 和Stack的实现知道Queue是用数组来实现的,数组的元素不断的通过Array.Copy从一个数组移动到另一个数组,ConcurrentQueue我们需要关心2点:1线程安全 ...
Spark：几种给Dataset增加列的方式、Dataset删除列、Dataset替换null列
几种给Dataset增加列的方式首先创建一个DF对象: scala> spark.version res0: String = .cloudera1 scala> val , , 2.0 ...

Python爬虫学习——布隆过滤器

Python爬虫学习——布隆过滤器的更多相关文章

随机推荐

热门专题