scrapy使用爬取多个页面

scrapy是个好玩的爬虫框架，基本用法就是：输入起始的一堆url，让爬虫去get这些网页，然后parse页面，获取自己喜欢的东西。。

用上去有django的感觉，有settings，有field。还会自动生成一堆东西。。

用法：scrapy-admin.py startproject abc 生成一个project。 试试就知道会生成什么东西。
在spiders包中新建一个py文件，里面写自定义的爬虫类。

自定义爬虫类必须有变量 domain_name 和 start_urls，和实例方法parse(self,response)..

它会在 Scrapy 查找我们的spider 的时候实例化，并自动被 Scrapy 的引擎找到。

爬虫的运行过程：

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests. 第一步的关键是start_response()..通过parse和start_urls来生成第一个请求。
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback. 在parse函数中可以返回request，或者items 或者一个生成器来产生这些。这些urls最后会被转给downloader去下载。然后无穷无尽的urls和items产生了。
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.你可以指定任何的selector，scrapy并不关心你用什么方法生成item，只是给了个XPth的selector而已。见过别人用lxml的，我更喜欢用beautifulsoup，bs的效率最慢。。。
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.最后这些items又被交给pipeline，在这里可以进行各种对item的处理，存数据库啦，写文件啦什么的。。

这是我本月爬糗事百科的spider：

 from scrapy.spider import BaseSpider

 import random,uuid

 from BeautifulSoup import BeautifulSoup as BS

 from scrapy.selector import HtmlXPathSelector

 from tutorial.items import TutorialItem

 def getname():

     return uuid.uuid1( ).hex()

 class JKSpider(BaseSpider):

     name='joke'

     allowed_domains=["qiushibaike.com"]

     start_urls=[

     "http://www.qiushibaike.com/month?slow",

     ]

     def parse(self,response):

         root=BS(response.body)

         items=[]

         x=HtmlXPathSelector(response)

         y=x.select("//div[@class='content' and @title]/text()").extract()

         for i in y:

             item=TutorialItem()

             item["content"]=i

             items.append(item)

         return items

　　Scrapy comes with some useful generic spiders that you can use, to subclass your spiders from. Their aim is to provide convenient functionality for a few common scraping cases, like following all links on a site based on certain rules, crawling from Sitemaps, or parsing a XML/CSV feed.

scrapy自带了许多爬虫，方便去继承。例如全站爬取。从sitemap中爬取，或者是爬取xml中的url。。

class scrapy.spider.BaseSpider

This is the simplest spider, and the one from which every other spider must inherit from (either the ones that come bundled with Scrapy, or the ones that you write yourself). It doesn’t provide any special functionality. It just requests the given start_urls/start_requests, and calls the spider’s method parse for each of the resulting responses.

这是所有爬虫的基类，他没有任何特别的功能，只是请求start_urls/start_requests,然后指定回调函数为parse。

name

A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.

If the spider scrapes a single domain, a common practice is to name the spider after the domain, or without the TLD. So, for example, a spider that crawls mywebsite.com would often be called mywebsite.

name一定要唯一，所以最好命名为域名。相当唯一啊。。

allowed_domains: An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.

不属于这些域名的url不会被爬取。前提是OffseiteMiddleware被启用了。

start_urls: A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.

起始url列表，不多说

start_requests()

This method must return an iterable with the first Requests to crawl for this spider.

这个函数必须得返回一个可迭代对象，以此生成requests

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.

这个方法在没有指定particular urls的时候被调用(感觉指的是scrapy的命令启动的时候加上url参数)。如果指定了起始抓取的url，就会调用make_requests_from_url()生成requests。这个函数只会被调用一次。

The default implementation uses make_requests_from_url() to generate Requests for each url in start_urls.

If you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do:

默认情况是调用make_requests_from_url来为start_urls生成请求。如果要自定义生成起始请求。

def start_requests(self):

    return [FormRequest("http://www.example.com/login",

                        formdata={'user': 'john', 'pass': 'secret'},

                        callback=self.logged_in)]

def logged_in(self, response):

    # here you would extract links to follow and return Requests for

    # each of them, with another callback

    pass

这样就可以来抓取登录后用户的数据啦。。。

make_requests_from_url(url)

A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. This method is used to construct the initial requests in the start_requests() method, and is typically used to convert urls to requests.

Unless overridden, this method returns Requests with the parse() method as their callback function, and with dont_filter parameter enabled (see Request class for more info).

这就是刚才说的，为url生成请求。。会为生成的request对象加上parse方法。。

parse(response)

This is the default callback used by Scrapy to process downloaded responses, when their requests don’t specify a callback.

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the BaseSpider class.

This method, as well as any other Request callback, must return an iterable of Request and/or Item objects.

Parameters:	response (:class:~scrapy.http.Response`) – the response to parse

这个方法得返回request或items。

log(message[, level, component])

Log a message using the scrapy.log.msg() function, automatically populating the spider argument with the name of this spider. For more information see Logging.

例子：

 from scrapy.selector import HtmlXPathSelector

 from scrapy.spider import BaseSpider

 from scrapy.http import Request

 from myproject.items import MyItem

 class MySpider(BaseSpider):

     name = 'example.com'

     allowed_domains = ['example.com']

     start_urls = [

         'http://www.example.com/1.html',

         'http://www.example.com/2.html',

         'http://www.example.com/3.html',

     ]

     def parse(self, response):

         hxs = HtmlXPathSelector(response)

         for h3 in hxs.select('//h3').extract():

             yield MyItem(title=h3)

         for url in hxs.select('//a/@href').extract():

             yield Request(url, callback=self.parse)

scrapy使用爬取多个页面的更多相关文章

爬虫框架Scrapy入门——爬取acg12某页面
1.安装1.1自行安装python3环境1.2ide使用pycharm1.3安装scrapy框架2.入门案例2.1新建项目工程2.2配置settings文件2.3新建爬虫app新建app将start_ ...
以豌豆荚为例，用 Scrapy 爬取分类多级页面
本文转载自以下网站:以豌豆荚为例,用 Scrapy 爬取分类多级页面 https://www.makcyun.top/web_scraping_withpython17.html 需要学习的地方: 1 ...
Scrapy Learning笔记（四）- Scrapy双向爬取
摘要:介绍了使用Scrapy进行双向爬取(对付分类信息网站)的方法. 所谓的双向爬取是指以下这种情况,我要对某个生活分类信息的网站进行数据爬取,譬如要爬取租房信息栏目,我在该栏目的索引页看到如下页面, ...
教程+资源,python scrapy实战爬取知乎最性感妹子的爆照合集(12G)!
一.出发点: 之前在知乎看到一位大牛(二胖)写的一篇文章:python爬取知乎最受欢迎的妹子(大概题目是这个,具体记不清了),但是这位二胖哥没有给出源码,而我也没用过python,正好顺便学一学,所以 ...
Python的scrapy之爬取顶点小说网的所有小说
闲来无事用Python的scrapy框架练练手,爬取顶点小说网的所有小说的详细信息. 看一下网页的构造: tr标签里面的 td 使我们所要爬取的信息下面是我们要爬取的二级页面小说的简介信息: 下面 ...
Python 2.7_爬取CSDN单页面博客文章及url(二)_xpath提取_20170118
上次用的是正则匹配文章title 和文章url,因为最近在看Scrapy框架爬虫需要了解xpath语法学习了下拿这个例子练手 1.爬取的单页面还是这个rooturl:http://blog.csd ...
scrapy图片-爬取哈利波特壁纸
话不多说,直接开始,直接放上整个程序过程 1.创建工程和生成spiders就不用说了,会用scrapy的都知道. 2.items.py class HarryItem(scrapy.Item): # ...
Scrapy+selenium爬取简书全站
Scrapy+selenium爬取简书全站环境 Ubuntu 18.04 Python 3.8 Scrapy 2.1 爬取内容文字标题作者作者头像发布日期内容文章连接文章ID 思路分 ...
【Scrapy(四)】scrapy 分页爬取以及xapth使用小技巧
scrapy 分页爬取以及xapth使用小技巧这里以爬取www.javaquan.com为例: 1.构建出下一页的url: 很显然通过dom树,可以发现下一页所在的a标签 2.使用scrapy的 ...

随机推荐

qt 共享内存(QSharedMemory)
——————————————————写入部分—————————————————— (本次程序基于控制台程序) 首先使用共享内存得召唤一下: #include <QSharedMemory> ...
【转】VS2010中使用AnkhSvn
今天想到要在自己的开发环境IDE(Visual Studio 2010)中安装一个代码管理器的插件,本人在使用VS2005的时候一直都是使用AnkhSvn-2.1.7444.278这版本,使用过程中也 ...
apktool重打包签名后安装出现“Failure [INSTALL_FAILED_ALREADY_EXISTS]”
一般修改.签名环节不出错的话,可以考虑看是不是包名重复的问题,如果系统中存在相同包名的应用,安装时会报这个错误就算apk名字变了,但和原来的包名仍是一样的,所以先卸载掉系统里同包名的应用,再尝试安装 ...
1.shell之搭建Shell编程环境
第一次写博客,加点废话,学习linux有一段时间,随着学习的深入发现自己学的不够系统,特别是遇到一些莫名的问题时,我只有各种百度,运气好时能解决掉,差时到现在还没解决,就算解决了还是不清楚是怎么解决的 ...
Eclipse下使用Hadoop单机模式调试MapReduce程序
在单机模式下Hadoop不会使用HDFS,也不会开启任何Hadoop守护进程,所有程序将在一个JVM上运行并且最多只允许拥有一个reducer 在Eclipse中新创建一个hadoop-test的Ja ...
处理 eclipse 导入报错 Invalid project description，问题
有时候在添加工程时,会出现如图所示的错误信息, ,提示显示将要添加的工程已经存在,但是在工作空间里却找不到,这个时候,要做就是, 在导入的时候选择General->Existing Projec ...
Es6 之for of
能工摹形,巧匠窃意. -- 毕加索 2016-10-10 <!DOCTYPE HTML> <html> <head> <script src="tr ...
Android 自定义Gallery浏览图片
之前写的<Android ImageSwitcher和Gallery的使用>一文中提到我在教室一下午为实现那个效果找各种资料.期间在网上找了一个个人觉得比较不错的效果,现在贴图上来: 其实 ...
C++实现RTMP协议发送H.264编码及AAC编码的音视频
http://www.cnblogs.com/haibindev/archive/2011/12/29/2305712.html C++实现RTMP协议发送H.264编码及AAC编码的音视频 RTMP ...
Ubuntu12.04 下安装Qt
1.下载Qt Creator 链接 http://qt-project.org/downloads 选择 Qt Creator 2.8.0 for Linux/X11 32-bit (61 MB) ...

scrapy使用爬取多个页面

scrapy使用爬取多个页面的更多相关文章

随机推荐

热门专题