如何使用Scrapy框架实现网络爬虫

现在用下面这个案例来演示如果爬取安居客上面深圳的租房信息，我们采取这样策略，首先爬取所有租房信息的链接地址，然后再根据爬取的地址获取我们所需要的页面信息。访问次数多了，会被重定向到输入验证码页面，这个问题后面有几种策略解决。

如果还不知道怎么去安装部署scrapy的参考我的另外一篇文章《快速部署网络爬虫框架scrapy》

1. 创建项目：

　　进入项目路径，使用命令 scrapy startproject anjuke_urls

　　进入项目路径，使用命令 scrapy startproject anjuke_zufang

2. 创建爬虫文件：

　　进入项目anjuke_urls的spider路径，使用命令 scrapy genspider anjuke_urls https://sz.zu.anjuke.com/

3. 爬虫anjuke_urls代码：

　　anjuke_urls.py

 # -*- coding: utf-8 -*-

 import scrapy

 from ..items import AnjukeUrlsItem

 class AnjukeGeturlsSpider(scrapy.Spider):

     name = 'anjuke_getUrls'

     start_urls = ['https://sz.zu.anjuke.com/']

     def parse(self, response):

         # 实例化类对象

         mylink = AnjukeUrlsItem()

         # 获取所有的链接列表

         links = response.xpath("//div[@class='zu-itemmod']/a/@href | //div[@class='zu-itemmod  ']/a/@href").extract()

         # 提取当前页面的所有租房链接

         for link in links:

             mylink['url'] = link

             yield mylink

         # 判断下一页是否能够点击，如果可以，继续并发送请求处理，直到链接全部提取完

         if len(response.xpath("//a[@class = 'aNxt']")) != 0:

             yield scrapy.Request(response.xpath("//a[@class = 'aNxt']/@href").extract()[0], callback= self.parse)

　　items.py

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # http://doc.scrapy.org/en/latest/topics/items.html

 import scrapy

 class AnjukeUrlsItem(scrapy.Item):

     # define the fields for your item here like:

     # name = scrapy.Field()

     # 租房链接

     url = scrapy.Field()

　　pipelines.py

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

 class AnjukeUrlsPipeline(object):

     # 以写入的方式打开文件link.txt，不存在会新建一个

     def open_spider(self, spider):

         self.linkFile = open('G:\Python\网络爬虫\\anjuke\data\link.txt', 'w', encoding='utf-8')

     # 将获取的所有url写进文件

     def process_item(self, item, spider):

         self.linkFile.writelines(item['url'] + "\n")

         return item

     # 关闭文件

     def close_spider(self, spider):

         self.linkFile.close()

　　setting.py

 # -*- coding: utf-8 -*-

 # Scrapy settings for anjuke_urls project

 #

 # For simplicity, this file contains only settings considered important or

 # commonly used. You can find more settings consulting the documentation:

 #

 #     http://doc.scrapy.org/en/latest/topics/settings.html

 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 BOT_NAME = 'anjuke_urls'

 SPIDER_MODULES = ['anjuke_urls.spiders']

 NEWSPIDER_MODULE = 'anjuke_urls.spiders'

 # Crawl responsibly by identifying yourself (and your website) on the user-agent

 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

 # Obey robots.txt rules

 ROBOTSTXT_OBEY = False

 # Configure maximum concurrent requests performed by Scrapy (default: 16)

 #CONCURRENT_REQUESTS = 32

 # Configure a delay for requests for the same website (default: 0)

 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

 # See also autothrottle settings and docs

 #DOWNLOAD_DELAY = 3

 # The download delay setting will honor only one of:

 #CONCURRENT_REQUESTS_PER_DOMAIN = 16

 #CONCURRENT_REQUESTS_PER_IP = 16

 # Disable cookies (enabled by default)

 #COOKIES_ENABLED = False

 # Disable Telnet Console (enabled by default)

 #TELNETCONSOLE_ENABLED = False

 # Override the default request headers:

 #DEFAULT_REQUEST_HEADERS = {

 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

 #   'Accept-Language': 'en',

 #}

 # Enable or disable spider middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 #SPIDER_MIDDLEWARES = {

 #    'anjuke_urls.middlewares.AnjukeUrlsSpiderMiddleware': 543,

 #}

 # Enable or disable downloader middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 #DOWNLOADER_MIDDLEWARES = {

 #    'anjuke_urls.middlewares.MyCustomDownloaderMiddleware': 543,

 #}

 # Enable or disable extensions

 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

 #EXTENSIONS = {

 #    'scrapy.extensions.telnet.TelnetConsole': None,

 #}

 # Configure item pipelines

 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

 ITEM_PIPELINES = {

     'anjuke_urls.pipelines.AnjukeUrlsPipeline': 300,

 }

 # Enable and configure the AutoThrottle extension (disabled by default)

 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html

 #AUTOTHROTTLE_ENABLED = True

 # The initial download delay

 #AUTOTHROTTLE_START_DELAY = 5

 # The maximum download delay to be set in case of high latencies

 #AUTOTHROTTLE_MAX_DELAY = 60

 # The average number of requests Scrapy should be sending in parallel to

 # each remote server

 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 # Enable showing throttling stats for every response received:

 #AUTOTHROTTLE_DEBUG = False

 # Enable and configure HTTP caching (disabled by default)

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

 #HTTPCACHE_ENABLED = True

 #HTTPCACHE_EXPIRATION_SECS = 0

 #HTTPCACHE_DIR = 'httpcache'

 #HTTPCACHE_IGNORE_HTTP_CODES = []

 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

　　middlewares.py

 # -*- coding: utf-8 -*-

 # Define here the models for your spider middleware

 #

 # See documentation in:

 # http://doc.scrapy.org/en/latest/topics/spider-middleware.html

 from scrapy import signals

 class AnjukeUrlsSpiderMiddleware(object):

     # Not all methods need to be defined. If a method is not defined,

     # scrapy acts as if the spider middleware does not modify the

     # passed objects.

     @classmethod

     def from_crawler(cls, crawler):

         # This method is used by Scrapy to create your spiders.

         s = cls()

         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

         return s

     def process_spider_input(self, response, spider):

         # Called for each response that goes through the spider

         # middleware and into the spider.

         # Should return None or raise an exception.

         return None

     def process_spider_output(self, response, result, spider):

         # Called with the results returned from the Spider, after

         # it has processed the response.

         # Must return an iterable of Request, dict or Item objects.

         for i in result:

             yield i

     def process_spider_exception(self, response, exception, spider):

         # Called when a spider or process_spider_input() method

         # (from other spider middleware) raises an exception.

         # Should return either None or an iterable of Response, dict

         # or Item objects.

         pass

     def process_start_requests(self, start_requests, spider):

         # Called with the start requests of the spider, and works

         # similarly to the process_spider_output() method, except

         # that it doesn’t have a response associated.

         # Must return only requests (not items).

         for r in start_requests:

             yield r

     def spider_opened(self, spider):

         spider.logger.info('Spider opened: %s' % spider.name)

4. 爬虫anjuke_zufang代码：

　　anjuke_zufang.py

 # -*- coding: utf-8 -*-

 import scrapy

 from ..items import AnjukeZufangItem

 class AnjukeZufangSpider(scrapy.Spider):

     name = 'anjuke_zufang'

     # 初始化一个空列表

     start_urls = []

     custom_settings = {'DOWNLOAD_DELAY' : 3}

 # 初始化start_urls，一定需要这一步 

 　　def __init__(self):  links = open('G:\Python\网络爬虫\\anjuke\data\link.txt')  for line in links:  # 去掉换行符，如果有换行符则无法访问网址  line = line[:-1]   self.start_urls.append(line)   def parse(self, response):  item = AnjukeZufangItem()  # 直接获取页面我们需要的数据  item['roomRent'] = response.xpath('//span[@class = "f26"]/text()').extract()[0]  item['rentMode'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[2]/dd/text()').extract()[0].strip()  item['roomLayout'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[3]/dd/text()').extract()[0].strip()  item['roomSize'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[3]/dd/text()').extract()[0]  item['LeaseMode'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[4]/dd/text()').extract()[0]  item['apartmentName'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[5]/dd/a/text()').extract()[0]  item['location1'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[6]/dd/a/text()').extract()[0]  item['location2'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[6]/dd/a[2]/text()').extract()[0]  item['floor'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[5]/dd/text()').extract()[0]  item['orientation'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[4]/dd/text()').extract()[0].strip()  item['decorationSituation'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[2]/dd/text()').extract()[0]  item['intermediaryName'] = response.xpath('//h2[@class="f16"]/text()').extract()[0]  item['intermediaryPhone'] = response.xpath('//p[@class="broker-mobile"]/text()').extract()[0]  item['intermediaryCompany'] = response.xpath('//div[@class="broker-company"]/p[1]/a/text()').extract()[0]  item['intermediaryStore'] = response.xpath('//div[@class="broker-company"]/p[2]/a/text()').extract()[0]  item['link'] = response.url   yield item

　　items.py

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # http://doc.scrapy.org/en/latest/topics/items.html

 import scrapy

 class AnjukeZufangItem(scrapy.Item):

     # define the fields for your item here like:

     # name = scrapy.Field()

     # 租金

     roomRent = scrapy.Field()

     # 租金压付方式

     rentMode = scrapy.Field()

     # 房型

     roomLayout = scrapy.Field()

     # 面积

     roomSize = scrapy.Field()

     # 租赁方式

     LeaseMode = scrapy.Field()

     # 所在小区

     apartmentName = scrapy.Field()

     # 位置

     location1 = scrapy.Field()

     location2 = scrapy.Field()

     # 楼层

     floor = scrapy.Field()

     # 朝向

     orientation = scrapy.Field()

     # 装修

     decorationSituation = scrapy.Field()

     # 中介名字

     intermediaryName = scrapy.Field()

     # 中介电话

     intermediaryPhone = scrapy.Field()

     # 中介公司

     intermediaryCompany = scrapy.Field()

     # 中介门店

     intermediaryStore = scrapy.Field()

     # 房屋链接

     link = scrapy.Field()

　　pipelines.py

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

 import sqlite3

 class AnjukeZufangPipeline(object):

     def open_spider(self, spider):

         self.file = open('G:\Python\网络爬虫\\anjuke\data\租房信息.txt', 'w', encoding='utf-8')
14

     def process_item(self, item, spider):

         self.file.write(

             item['roomRent'] + "," + item['rentMode'] + "," + item['roomLayout'] + "," + item['roomSize'] + "," + item[

                 'LeaseMode'] + "," + item['apartmentName'] + "," + item['location1'] + " " + item['location2'] + "," + item[

                 'floor'] + "," + item['orientation'] + "," + item['decorationSituation'] + "," + item['intermediaryName'] +

                  "," + item['intermediaryPhone'] + "," + item['intermediaryCompany'] + "," + item['intermediaryStore'] + ","

                  + item['link'] + '\n')

         return item

     def spoder_closed(self, spider):

         self.file.close()

　　setting.py

 # -*- coding: utf-8 -*-

 # Scrapy settings for anjuke_zufang project

 #

 # For simplicity, this file contains only settings considered important or

 # commonly used. You can find more settings consulting the documentation:

 #

 #     http://doc.scrapy.org/en/latest/topics/settings.html

 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 BOT_NAME = 'anjuke_zufang'

 SPIDER_MODULES = ['anjuke_zufang.spiders']

 NEWSPIDER_MODULE = 'anjuke_zufang.spiders'

 # Crawl responsibly by identifying yourself (and your website) on the user-agent

 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

 # Obey robots.txt rules

 ROBOTSTXT_OBEY = False

 # Configure maximum concurrent requests performed by Scrapy (default: 16)

 #CONCURRENT_REQUESTS = 32

 # Configure a delay for requests for the same website (default: 0)

 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

 # See also autothrottle settings and docs

 #DOWNLOAD_DELAY = 3

 # The download delay setting will honor only one of:

 #CONCURRENT_REQUESTS_PER_DOMAIN = 16

 #CONCURRENT_REQUESTS_PER_IP = 16

 # Disable cookies (enabled by default)

 #COOKIES_ENABLED = False

 # Disable Telnet Console (enabled by default)

 #TELNETCONSOLE_ENABLED = False

 # Override the default request headers:

 #DEFAULT_REQUEST_HEADERS = {

 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

 #   'Accept-Language': 'en',

 #}

 # Enable or disable spider middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 #SPIDER_MIDDLEWARES = {

 #    'anjuke_zufang.middlewares.AnjukeZufangSpiderMiddleware': 543,

 #}

 # Enable or disable downloader middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 #DOWNLOADER_MIDDLEWARES = {

 #    'anjuke_zufang.middlewares.MyCustomDownloaderMiddleware': 543,

 #}

 # Enable or disable extensions

 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

 #EXTENSIONS = {

 #    'scrapy.extensions.telnet.TelnetConsole': None,

 #}

 # Configure item pipelines

 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

 ITEM_PIPELINES = {

     'anjuke_zufang.pipelines.AnjukeZufangPipeline': 300,

 }

 # Enable and configure the AutoThrottle extension (disabled by default)

 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html

 #AUTOTHROTTLE_ENABLED = True

 # The initial download delay

 #AUTOTHROTTLE_START_DELAY = 5

 # The maximum download delay to be set in case of high latencies

 #AUTOTHROTTLE_MAX_DELAY = 60

 # The average number of requests Scrapy should be sending in parallel to

 # each remote server

 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 # Enable showing throttling stats for every response received:

 #AUTOTHROTTLE_DEBUG = False

 # Enable and configure HTTP caching (disabled by default)

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

 #HTTPCACHE_ENABLED = True

 #HTTPCACHE_EXPIRATION_SECS = 0

 #HTTPCACHE_DIR = 'httpcache'

 #HTTPCACHE_IGNORE_HTTP_CODES = []

 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

　　middlewares.py

 # -*- coding: utf-8 -*-

 # Define here the models for your spider middleware

 #

 # See documentation in:

 # http://doc.scrapy.org/en/latest/topics/spider-middleware.html

 from scrapy import signals

 class AnjukeZufangSpiderMiddleware(object):

     # Not all methods need to be defined. If a method is not defined,

     # scrapy acts as if the spider middleware does not modify the

     # passed objects.

     @classmethod

     def from_crawler(cls, crawler):

         # This method is used by Scrapy to create your spiders.

         s = cls()

         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)

         return s

     def process_spider_input(self, response, spider):

         # Called for each response that goes through the spider

         # middleware and into the spider.

         # Should return None or raise an exception.

         return None

     def process_spider_output(self, response, result, spider):

         # Called with the results returned from the Spider, after

         # it has processed the response.

         # Must return an iterable of Request, dict or Item objects.

         for i in result:

             yield i

     def process_spider_exception(self, response, exception, spider):

         # Called when a spider or process_spider_input() method

         # (from other spider middleware) raises an exception.

         # Should return either None or an iterable of Response, dict

         # or Item objects.

         pass

     def process_start_requests(self, start_requests, spider):

         # Called with the start requests of the spider, and works

         # similarly to the process_spider_output() method, except

         # that it doesn’t have a response associated.

         # Must return only requests (not items).

         for r in start_requests:

             yield r

     def spider_opened(self, spider):

         spider.logger.info('Spider opened: %s' % spider.name)

5. 依次运行爬虫：

　　进入爬虫anjuke_urls项目，运行scrapy crawl anjuke_getUrls

　　进入爬虫anjuke_urls项目，运行scrapy crawl anjuke_zufang

如何使用Scrapy框架实现网络爬虫的更多相关文章

使用 Scrapy 构建一个网络爬虫
来自weixin 记得n年前项目需要一个灵活的爬虫工具,就组织了一个小团队用Java实现了一个爬虫框架,可以根据目标网站的结构.地址和需要的内容,做简单的配置开发,即可实现特定网站的爬虫功能.因为要考 ...
使用Scrapy构建一个网络爬虫
记得n年前项目需要一个灵活的爬虫工具,就组织了一个小团队用Java实现了一个爬虫框架,可以根据目标网站的结构.地址和需要的内容,做简单的配置开发,即可实现特定网站的爬虫功能.因为要考虑到各种特殊情形, ...
Scrapy 轻松定制网络爬虫(转)
网络爬虫(Web Crawler, Spider)就是一个在网络上乱爬的机器人.当然它通常并不是一个实体的机器人,因为网络本身也是虚拟的东西,所以这个“机器人”其实也就是一段程序,并且它也不是乱爬,而 ...
python学习之-用scrapy框架来创建爬虫(spider)
scrapy简单说明 scrapy 为一个框架框架和第三方库的区别: 库可以直接拿来就用, 框架是用来运行,自动帮助开发人员做很多的事,我们只需要填写逻辑就好命令: 创建一个项目 : cd 到需 ...
Scrapy框架——CrawlSpider类爬虫案例
Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...
一个基于Scrapy框架的pixiv爬虫
源码 https://github.com/vicety/Pixiv-Crawler,功能什么的都在这里介绍了说几个重要的部分吧登录部分困扰我最久的部分,网上找的其他pixiv爬虫的登录方式大多 ...
scrapy框架修改单个爬虫的配置,包括下载延时，下载超时设置
在一个框架里面有多个爬虫时,每个爬虫的需求不相同,例如,延时的时间,所以可以在这里配置一下custom_settings = {},大括号里面写需要修改的配置,然后就能把settings里面的配置给覆 ...
python基于scrapy框架的反爬虫机制破解之User-Agent伪装
user agent是指用户代理,简称 UA. 作用:使服务器能够识别客户使用的操作系统及版本.CPU 类型.浏览器及版本.浏览器渲染引擎.浏览器语言.浏览器插件等. 网站常常通过判断 UA 来给不同 ...
基于scrapy框架的分布式爬虫
分布式概念:可以使用多台电脑组件一个分布式机群,让其执行同一组程序,对同一组网络资源进行联合爬取. 原生的scrapy是无法实现分布式调度器无法被共享管道无法被共享基于 scrapy+redi ...

随机推荐

【转】数据库介绍（MySQL安装体系结构、基本管理）
[转]数据库介绍(MySQL安装体系结构.基本管理) 第1章数据库介绍及mysql安装 1.1 数据库简介数据库,简而言之可视为电子化的文件柜——存储电子文件的处所,用户可以对文件中的数据运行新 ...
MySQL全备+binlog恢复方法之伪装master【原创】
利用mysql全备 +binlog server恢复方法之伪装master 单实例试验一.试验环境 10.72.7.40 实例 mysql3306为要恢复的对象,mysql3306的全备+binlo ...
with语法
上下文管理协议要使用 with 语句,首先要明白上下文管理器这一概念.有了上下文管理器,with 语句才能工作. 下面是一组与上下文管理器和with 语句有关的概念. 上下文管理协议(Context ...
Linux 网络侦错：无法联机原因分析
所谓的软件问题,绝大部分就是 IP 参数设定错误啊,路由不对啊,还有 DNS 的 IP 设定错误等等的, 这些问题都是属于软件设定啦!只要将设定改一改,利用一些侦测软件查一查,就知道问题出在哪里了!基 ...
NO-CARRIER
自己动手写了创建虚拟接口,删除虚拟接口程序,频繁调用创建删除时,有时将接口up起来时会报错: Name not unique on network 利用ip link命令来查看接口(及其对应的索引) ...
MySQL--详细查询操作(单表记录查询、多表记录查询(连表查询)、子查询)
一.单表查询 1.完整的语法顺序(可以不写完整,其次顺序要对) (不分组,且当前表使用聚合函数: 当前表为一组,显示统计结果 ) select distinct [*,查询字段1,查询字段2,表达式, ...
FS 日志空间限定
一.说明: FS默认安装的log文件,仅仅的限制了每个文件的大小,但是没有限制文件的个数.这种情况下,在FS运行很长时间之后,会出现物理空间不够的情况,导致FS或者mysql 或者其他应用没有空间使用 ...
HDU 4455
题意 : 题目给你一个序列 , 查询 t ,问序列连续长度为 t 的子区间的不同数的和巧妙的动态规划数据大, Dp可以 O(n) #include<iostream> # ...
Laravel 5.2--git冲突error: Your local changes to the following files would be overwritten by merge:
今天在服务器上git pull是出现以下错误: error: Your local changes to the following files would be overwritten by mer ...
<转载>ford-fulkerson算法2
原文链接https://www.cnblogs.com/luweiseu/archive/2012/07/14/2591573.html 作者:wlu 7. 网络流算法--Ford-Fulkerson ...

如何使用Scrapy框架实现网络爬虫

如何使用Scrapy框架实现网络爬虫的更多相关文章

随机推荐

热门专题