【scrapy】使用方法概要（四）(转)

【请初学者作为参考，不建议高手看这个浪费时间】

上一篇文章，我们抓取到了一大批代理ip，本篇文章介绍如何实现downloaderMiddleware，达到随即使用代理ip对目标网站进行抓取的。

抓取的目标网站是现在炙手可热的旅游网站 www.qunar.com, 目标信息是qunar的所有seo页面，及页面的seo相关信息。

qunar并没有一般网站具有的 robots.txt文件，所以无法利用列表进行抓取，但是，可以发现，qunar的seo页面主要部署在

http://www.qunar.com/routes/下，这个页面为入口文件，由此页面及此页面上所有带有routes的链接开始递归的抓取所有带有routes/字段的链接即可。

开始吧

目标信息为目标站点的seo信息，所以为head中的meta和description字段。

 1 # Define here the models for your scraped items

 2 #

 3 # See documentation in:

 4 # http://doc.scrapy.org/topics/items.html

 5

 6 from scrapy.item import Item, Field

 7

 8 class SitemapItem(Item):

 9     # define the fields for your item here like:

10     # name = Field()

11     url = Field()

12     keywords = Field()

13     description = Field()

因为要使用代理ip，所以需要实现自己的downloadermiddlerware，主要功能是从代理ip文件中随即选取一个ip端口作为代理服务，代码如下

 1 import random

 2

 3 class ProxyMiddleware(object):

 4     def process_request(self, request, spider):

 5         fd = open('/home/xxx/services_runenv/crawlers/sitemap/sitemap/data/proxy_list.txt','r')

 6         data = fd.readlines()

 7         fd.close()

 8         length = len(data)

 9         index  = random.randint(0, length -1)

10         item   = data[index]

11         arr    = item.split(',')

12         request.meta['proxy'] = 'http://%s:%s' % (arr[0],arr[1])

最重要的还是爬虫，主要功能是提取页面所有的链接，把满足条件的url实例成Request对象并yield，同时提取页面的keywords，description信息，以item的形式yield，代码如下:

 1 from scrapy.selector import HtmlXPathSelector

 2 from sitemap.items import SitemapItem

 3

 4 import urllib

 5 import simplejson

 6 import exceptions

 7 import pickle

 8

 9 class SitemapSpider(CrawlSpider):

10     name = 'sitemap_spider'

11     allowed_domains = ['qunar.com']

12     start_urls = ['http://www.qunar.com/routes/']

13

14     rules = (

15         #Rule(SgmlLinkExtractor(allow=(r'http://www.qunar.com/routes/.*')), callback='parse'),

16         #Rule(SgmlLinkExtractor(allow=('http:.*/routes/.*')), callback='parse'),

17     )

18

19     def parse(self, response):

20         item = SitemapItem()

21         x         = HtmlXPathSelector(response)

22         raw_urls  = x.select("//a/@href").extract()

23         urls      = []

24         for url in raw_urls:

25             if 'routes' in url:

26                 if 'http' not in url:

27                     url = 'http://www.qunar.com' + url

28                 urls.append(url)

29

30         for url in urls:

31             yield Request(url)

32

33         item['url']         = response.url.encode('UTF-8')

34         arr_keywords        = x.select("//meta[@name='keywords']/@content").extract()

35         item['keywords']    = arr_keywords[0].encode('UTF-8')

36         arr_description     = x.select("//meta[@name='description']/@content").extract()

37         item['description'] = arr_description[0].encode('UTF-8')

38

39         yield item

pipe文件比较简单，只是把抓取到的数据存储起来，代码如下

 1 # Define your item pipelines here

 2 #

 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 4 # See: http://doc.scrapy.org/topics/item-pipeline.html

 5

 6 class SitemapPipeline(object):

 7     def process_item(self, item, spider):

 8         data_path = '/home/xxx/services_runenv/crawlers/sitemap/sitemap/data/output/sitemap_data.txt'

 9         fd = open(data_path, 'a')

10         line = str(item['url']) + '#$#' + str(item['keywords']) + '#$#' + str(item['description']) + '\n'

11         fd.write(line)

12         fd.close

13         return item

最后附上的是setting.py文件

# Scrapy settings for sitemap project

#

# For simplicity, this file contains only the most important settings by

# default. All the other settings are documented here:

#

#     http://doc.scrapy.org/topics/settings.html

#

BOT_NAME = 'sitemap hello,world~!'

BOT_VERSION = '1.0'

SPIDER_MODULES = ['sitemap.spiders']

NEWSPIDER_MODULE = 'sitemap.spiders'

USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0

ITEM_PIPELINES = [

    'sitemap.pipelines.SitemapPipeline'

]

DOWNLOADER_MIDDLEWARES = {

    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,

    'sitemap.middlewares.ProxyMiddleware': 100,

    }

CONCURRENT_ITEMS = 128

CONCURRENT_REQUEST = 64

CONCURRENT_REQUEST_PER_DOMAIN = 64

LOG_ENABLED = True

LOG_ENCODING = 'utf-8'

LOG_FILE = '/home/xxx/services_runenv/crawlers/sitemap/sitemap/log/sitemap.log'

LOG_LEVEL = 'DEBUG'

LOG_STDOUT = False

对scrapy的介绍将告一段落，更复杂的应用还没有接触过，想等看完redis的源码，再来研究下scrapy的源码~~ 希望通过分享能给正在入门scrapy的童鞋带来帮助~

喜欢一起简单，实用的东西，拒绝复杂花哨，我不是GEEK.

【scrapy】使用方法概要（四）(转)的更多相关文章

编写高质量JS代码的68个有效方法（四）
[20141129]编写高质量JS代码的68个有效方法(四) *:first-child { margin-top: 0 !important; } body>*:last-child { ma ...
scrapy爬虫学习系列四：portia的学习入门
系列文章列表: scrapy爬虫学习系列一:scrapy爬虫环境的准备: http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_00 ...
Python3 Scrapy 安装方法
Python3 Scrapy 安装方法 (一脸辛酸泪) 写在前面最近在学习爬虫,在熟悉了Python语言和BeautifulSoup4后打算下个爬虫框架试试. 没想到啊,这坑太深了... 看了看相关 ...
【scrapy】使用方法概要（三）(转)
请初学者作为参考,不建议高手看这个浪费时间] 前两篇大概讲述了scrapy的安装及工作流程.这篇文章主要以一个实例来介绍scrapy的开发流程,本想以教程自带的dirbot作为例子,但感觉大家应该最先 ...
【scrapy】使用方法概要（二）(转)
[请初学者作为参考,不建议高手看这个浪费时间] 上一篇文章里介绍了scrapy的主要优点及linux下的安装方式,此篇文章将简要介绍scrapy的爬取过程,本文大部分内容源于scrapy文档,翻译并加 ...
【scrapy】使用方法概要（一）(转)
[请初学者作为参考,不建议高手看这个浪费时间] 工作中经常会有这种需求,需要抓取互联网上的数据.笔者就经常遇到这种需求,一般情况下会临时写个抓取程序,但是每次遇到这种需求的时候,都几乎要重头写,特别是 ...
爬虫入门之scrapy模拟登陆(十四)
注意:模拟登陆时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态 COOKIES_ENABLED = True或# COOKIES_ENABLE ...
小白学 Python 爬虫（36）：爬虫框架 Scrapy 入门基础（四） Downloader Middleware
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
Scrapy Learning笔记（四）- Scrapy双向爬取
摘要:介绍了使用Scrapy进行双向爬取(对付分类信息网站)的方法. 所谓的双向爬取是指以下这种情况,我要对某个生活分类信息的网站进行数据爬取,譬如要爬取租房信息栏目,我在该栏目的索引页看到如下页面, ...

随机推荐

Java项目打war包的方法
我们可以运用DOS命令来手工打war包: 首先,打开DOS命令行,敲入“jar”,我们发现它提示不是内部或外部的命令这样的错误,这时八成是你的JAVA环境没有配置好,我们可以用JAVA_HOME方式或 ...
[android]解析XML文件的方法有三种：PULL,DOM,SAM
PULL 的工作原理: XML pull提供了开始元素和结束元素.当某个元素开始时,可以调用parser．nextText从XML文档中提取所有字符数据.当解析到一个文档结束时,自动生成EndDocu ...
AdvStringGrid 列宽度、列移动、行高度、自动调节
那么有没有办法,让客户自己去调整列的宽度呢? 那么有没有办法让列宽度.行高度随着内容而自动变换呢: unit Unit5; interface uses Winapi.Windows, Winap ...
CF1030A 【In Search of an Easy Problem】
题目巨简单,主要是给大家翻译一下给n个数,其中存在1就输出HARD,否则输出EASY,不区分大小写 #include<iostream> #include<cstdio> u ...
Java工程师知识图谱
一.Java工程师知识图谱(思维导图版) 二.Java工程师知识图谱(图文版) 三.Java工程师知识图谱(文字版) http://note.youdao.com/noteshare?id=615da ...
MySQL学习笔记：delete from与truncate table的区别
在Mysql数据库的使用过程中,删除表数据可以通过以下2种方式: delete from table_name truncate table table_name (1)delete from语句可以 ...
Icon.png pngcrush caught libpng error:Read
[问题处理]Icon.png pngcrush caught libpng error:Read Error 遇到问题在项目Archive时,遇到 Icon.png pngcrush caught ...
SqlServer性能优化 Sql语句优化（十四）
一:在较小的结果集上上操作 1.仅返回需要的列 2.分页获取数据 EF实现分页: public object getcp(int skiprows,int currentpagerows) { HRU ...
企业级Docker Registry —— Harbor搭建和使用
本节内容: Harbor介绍安装部署Harbor 环境要求环境信息安装部署harbor 配置harbor 配置存储完成安装和启动harbor 访问Harbor 修改管理员密码启动后相关容器 ...
CRLF LF CR
The Carriage Return (CR) character (0x0D, \r) moves the cursor to the beginning of the line without ...

【scrapy】使用方法概要（四）(转)

【scrapy】使用方法概要（四）(转)的更多相关文章

随机推荐

热门专题