实现一个简单的邮箱地址爬虫（python)

　　我经常收到关于email爬虫的问题。有迹象表明那些想从网页上抓取联系方式的人对这个问题很感兴趣。在这篇文章里，我想演示一下如何使用python实现一个简单的邮箱爬虫。这个爬虫很简单，但从这个例子中你可以学到许多东西（尤其是当你想做一个新虫的时候）。

　　我特意简化了代码，尽可能的把主要思路表达清楚。这样你就可以在需要的时候加上自己的功能。虽然很简单，但完整的实现从网上抓取email地址的功能。注意，本文的代码是使用python3写的。

　　好。让我们逐步深入吧。我一点一点的实现，并加上注释。最后再把完整的代码贴出来。

　　首先引入所有必要的库。在这个例子中，我们使用的BeautifulSoup 和 Requests 是第三方库，urllib, collections 和 re 是内置库。

BeaufulSoup可以使检索Html文档更简便，Requests让执行web请求更容易。

from bs4 import BeautifulSoup

import requests

import requests.exceptions

from urllib.parse import urlsplit

from collections import deque

import re

　　下面我定义了一个列表，用于存放要抓取的网页地址，比如http://www.huazeming.com/ ，当然你也可以找有明显email地址的网页作为地址，数量不限。虽然这个集合应该是个列表（在python中），但我选择了 deque 这个类型，因为这个更符合我们的需要。

# a queue of urls to be crawled

new_urls = deque(['http://www.themoscowtimes.com/contact_us/'])

　　接下来，我们需要把处理过的url存起来，以避免重复处理。我选择set类型，因为这个集合可以保证元素值不重复。

# a set of urls that we have already crawled

processed_urls = set()

　　定义一个email集合，用于存储收集到地址：

# a set of crawled emails

emails = set()

　　让我们开始抓取吧！我们有一个循环，不断取出队列的地址进行处理，直到队列里没有地址为止。取出地址后，我们立即把这个地址加到已处理的地址列表中，以免将来忘记。

# process urls one by one until we exhaust the queue

while len(new_urls):

    # move next url from the queue to the set of processed urls

    url = new_urls.popleft()

    processed_urls.add(url)

　　然后我们需要从当前地址中提取出根地址，这样当我们从文档中找到相对地址时，我们就可以把它转换成绝对地址。

# extract base url and path to resolve relative links

parts = urlsplit(url)

base_url = "{0.scheme}://{0.netloc}".format(parts)

path = url[:url.rfind('/')+1] if '/' in parts.path else url

　　下面我们从网上获取页面内容，如果遇到错误，就跳过继续处理下一个网页。

# get url's content

print("Processing %s" % url)

try:

    response = requests.get(url)

except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):

    # ignore pages with errors

    continue

　　当我们得到网页内容后，我们找到内容里所有email地址，把其添加到列表里。我们使用正则表达式提取email地址：

# extract all email addresses and add them into the resulting set

new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))

emails.update(new_emails)

　　在我们提取完当前网页内容的email地址后，我们找到当前网页中的其他网页地址，并将其添加到带处理的地址队列里。这里我们使用BeautifulSoup库来分析网页html。

# create a beutiful soup for the html document

soup = BeautifulSoup(response.text)

　　这个库的find_all方法可以根据html标签名来抽取元素。

# find and process all the anchors in the document

for anchor in soup.find_all("a"):

　　但网页总的有些a标签可能不包含url地址，这个我们需要考虑到。

# extract link url from the anchor

link = anchor.attrs["href"] if "href" in anchor.attrs else ''

　　如果这个地址以斜线开头，那么我们把它当做相对地址，然后给他加上必要的根地址：

# add base url to relative links

if link.startswith('/'):

    link = base_url + link

　　到此我们得到了一个有效地址（以http开头），如果我们的地址队列没有，而且之前也没有处理过，那我们就把这个地址加入地址队列里:

# add the new url to the queue if it's of HTTP protocol, not enqueued and not processed yet

if link.startswith('http') and not link in new_urls and not link in processed_urls:

    new_urls.append(link)

　　好，就是这样。以下是完整代码：

from bs4 import BeautifulSoup

import requests

import requests.exceptions

from urllib.parse import urlsplit

from collections import deque

import re

# a queue of urls to be crawled

new_urls = deque(['http://www.themoscowtimes.com/contact_us/index.php'])

# a set of urls that we have already crawled

processed_urls = set()

# a set of crawled emails

emails = set()

# process urls one by one until we exhaust the queue

while len(new_urls):

    # move next url from the queue to the set of processed urls

    url = new_urls.popleft()

    processed_urls.add(url)

    # extract base url to resolve relative links

    parts = urlsplit(url)

    base_url = "{0.scheme}://{0.netloc}".format(parts)

    path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content

    print("Processing %s" % url)

    try:

        response = requests.get(url)

    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):

        # ignore pages with errors

        continue

    # extract all email addresses and add them into the resulting set

    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))

    emails.update(new_emails)

    # create a beutiful soup for the html document

    soup = BeautifulSoup(response.text)

    # find and process all the anchors in the document

    for anchor in soup.find_all("a"):

        # extract link url from the anchor

        link = anchor.attrs["href"] if "href" in anchor.attrs else ''

        # resolve relative links

        if link.startswith('/'):

            link = base_url + link

        elif not link.startswith('http'):

            link = path + link

        # add the new url to the queue if it was not enqueued nor processed yet

        if not link in new_urls and not link in processed_urls:

            new_urls.append(link)

　　这个爬虫比较简单，省去了一些功能（比如把邮箱地址保存到文件中），但提供了编写邮箱爬虫的一些基本原则。你可以尝试对这个程序进行改进。

　　当然，如果你有任何问题和建议，欢迎指正！

　　英文原文：A Simple Email Crawler in Python

实现一个简单的邮箱地址爬虫（python)的更多相关文章

用一个简单的例子来理解python高阶函数
============================ 用一个简单的例子来理解python高阶函数 ============================ 最近在用mailx发送邮件, 写法大致如 ...
一个简单的开源PHP爬虫框架『Phpfetcher』
这篇文章首发在吹水小镇:http://blog.reetsee.com/archives/366 要在手机或者电脑看到更好的图片或代码欢迎到博文原地址.也欢迎到博文原地址批评指正. 转载请注明: 吹水 ...
一个简单、易用的Python命令行(terminal)进度条库
eprogress 是一个简单.易用的基于Python3的命令行(terminal)进度条库,可以自由选择使用单行显示.多行显示进度条或转圈加载方式,也可以混合使用. 示例单行进度条多行进度条圆 ...
[Python Study Notes]一个简单的区块链结构(python 2.7)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' ...
一个简单的app自动登录Python脚本案例
一个简单的go语言爬虫
package main import ( "bufio" "fmt" "golang.org/x/net/html/charset" &q ...
Python学习手册之正则表达式示例--邮箱地址提取
在上一篇文章中,我们介绍了 Python 的捕获组和特殊匹配字符串,现在我们介绍 Python 的正则表达式使用示例.查看上一篇文章请点击:https://www.cnblogs.com/dustma ...
使用James搭建一个自己的邮箱服务器
---第一天开发--- 下载Apache James 3.0邮箱服务器,解压到响应的目录可以看到目录结构: H:\code\JavaCode\James\apache-james-3.0-beta4 ...
一个简单的python爬虫,爬取知乎
一个简单的python爬虫,爬取知乎主要实现爬取一个收藏夹里所有问题答案下的图片文字信息暂未收录,可自行实现,比图片更简单具体代码里有详细注释,请自行阅读项目源码: # -*- cod ...

随机推荐

克拉夫斯曼高端定制刘霞--－【YBC中国国际青年创业计划】
克拉夫斯曼高端定制刘霞---[YBC中国国际青年创业计划] 克拉夫斯曼高端定制刘霞
Risk（最短路）
Risk Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 2915 Accepted: 1352 Description ...
从头开始－02.C语言基础
变量的内存分析: #include <stdio.h> int main() { //内存地址由大到小 int a=10; int b=20; //&是一个地址运算符,取得变量的地 ...
ASP.NET对路径"xxxxx"的访问被拒绝的解决方法小结
异常详细信息: System.UnauthorizedAccessException: 对路径“D:/temp1/MyTest.txt”的访问被拒绝在windows 2003下,在运行web ...
IOS app启动过程
1.main函数 2.UIApplicationMain * 创建UIApplication对象 * 创建UIApplication的delegate对象 3.delegate对象开始处理(监 ...
C# List<T>中Select List Distinct()去重复
List<ModelJD> data = myDalJD.GetAllDataList(); List<string> list= new List<string> ...
多字节字符与界面 manifest
之前把调试项目的时候软件界面变成了很古板的那种界面,后来查了一会发现因为字符集的改变,个人习惯统一我一般用同一种字符集,虽然Unicode只涉及语言问题,不过总感觉它占内存,用非字符集,搜索发现将代码 ...
Mocha 从0开始
Mocha Mocha 是具有丰富特性的 JavaScript 测试框架,可以运行在 Node.js 和浏览器中,使得异步测试更简单更有趣.Mocha 可以持续运行测试,支持灵活又准确的报告,当映射到 ...
什么是JSON对象
1.什么是json? JSON全称是JavaScript Object Notation,是一种轻量级的数据交换格式.JSON 与XML具有相同的特性,是一种数据存储格式,但是JSON相比XML 更易 ...
android 栈方式退出
介于list退出方式会使内存溢出,使用自己维护栈的方式. 参考: http://www.2cto.com/kf/201312/265523.html http://www.cnblogs.com/ma ...

实现一个简单的邮箱地址爬虫（python)

实现一个简单的邮箱地址爬虫（python)的更多相关文章

随机推荐

热门专题