【目标】要完成的任务如下:

※ 创建一个 Scrap项目。
※ 创建一个 Spider来抓取站点和处理数据。
※ 通过命令行将抓取的内容导出。
※ 将抓取的内容保存的到 MongoDB数据库。
==============================================

【准备工作】需要安装好 Scrapy框架、 MongoDB和 PyMongo库

1.创建项目:

【操作】在想创建项目的目录按:shift+右键——在此处打开命令窗口(或 在cmd里cd进入想要的目录)输入CMD命令(此处不用test,因为系统已经有一个此名字,将报错提示已存在):

注意:在linux 系统内如提示权限问题,则在命令前加 sudo ~

  1. scrapy startproject tutorial # 生成项目命令:最后跟的参数tutorial是项目名,也将是文件夹名

【说明】生成的目录结构如下:

  1. tutorial # 项目总文件夹
  2.   tutorial # tutorial文件夹:保存的是要引用的模块;
  3.   crapy.cfg # cfg文件:是scrapy部署时的配置文件;

【说明】../tutorial/tutorial 目录下文件及作用:

  1. tutorial
    __init__.py
  2. items.py # Items的定义,定义爬取的数据结构
  3. middlewares.py # Middlewares的定义,定义爬取时的中间件
  4. pipelines. py # Pipelines的定义,定义数据管道
  5. settings.py # 配置文件
  6. spiders # 放置 Spiders的文件夹
  7. __init__.py

2.创建 Spider

【操作】Spider是自己定义的类, Scrapy用它来从网页里抓取内容,并解析抓取的结果。不过这个类必须继承 Scrapy提供的 Spider类 scrapy. Spider,还要定义Spider的名称和起始请求,以及怎样处理爬取后的结果的方法。也可以使用命令行创建一个 Spider。比如要生成 quotes这个 Spider,可以执行如下cmd命令

【命令解释】进入刚才创建的tutorial文件夹,然后执行 genspider命令:第一个参数是 Spider的名称,第二个参数是网站域名:

  1. cd tutorial
  2. scrapy genspider quotes quotes.toscrape.com

【说明】执行完毕之后, spiders文件夹中多了一个 quotes. py(目录:E:\a\scrapy_test\tutorial\tutorial\spiders),它就是刚刚创建的 Spider,内容如下(其网址,名仅作表示,真代码见4步):

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3.  
  4. class QuotesSpider(scrapy.Spider):
  5. name = 'baidu'
  6. allowed_domains = ['fanyi.baidu.com\']
  7. start_urls = ['http://fanyi.baidu.com\']
  8.  
  9. def parse(self, response):
  10. pass

【说明】这里有三个属性—name、 allowed domains和 start urls,还有一个方法 parse:

★name:它是每个项目唯一的名字,用来区分不同的 Spider

★allowed domains:它是允许爬取的域名,如果初始或后续的请求链接不是这个域名下的,则请求链接会被过滤掉。

★start urls:它包含了 Spider在启动时爬取的url列表,初始请求是由它来定义的。

★parse:它是 Spider的一个方法。默认情况下,被调用时 start urls里面的链接构成的请求完成下载执行后,返回的响应就会作为唯一的参数传递给这个函数。该方法负责解析返回的响应、提取数据或者进一步生成要处理的请求。

3.创建ltem

【说明】Item是保存爬取数据的容器,它的使用方法和字典类似。不过,相比字典,ltem多了额外的保护机制,可以避免拼写错误或者定义字段错误。
创建Iem需要继承 scrapy.Item类,并且定义类型为 scrapy. Field的字段。

【操作】定义Item,观察目标网站,我们可以获取到到内容有text、author、tags。此时将 Items.py修改如下:

  1. import scrapy
  2. class QuoteItem(scrapy.Item)
  3. text= scrapy. Field()
  4. author= scrapy. Field()
  5. tags= scrapy. Field()
  6. #这里定义了三个字段,接下来爬取时我们会使用到这个Item。

4.解析response

【说明】quotes.py中我们看到, parse()方法的参数 respose是 start urls里面的链接爬取后的结果。所以在parse()方法中,我们可以直接对 response变量包含的内容进行解析,比如浏览请求结果的网页源代码,或者进一步分析源代码内容,或者找出结果中的链接而得到下一个请求。

【网页分析】我们可以看到网页中既有我们想要的结果,又有下一页的链接,这两部分内容我们都要进行处理。首先看看网页结构,如图13-2所示。每一页都有多个class为 quote的区块,每个区块内都包含text、 author、tags。那么我们先找出所有的 quote,然后提取每一个 quote中的内容。

quotes.py写成如下:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3.  
  4. class QuotesSpider(scrapy.Spider):
  5. name = 'quotes'
  6. allowed_domains = ['quotes.toscrape.com']
  7. start_urls = ['http://quotes.toscrape.com/']
  8.  
  9. def parse(self, response):
  10. quotes=response.css('.quote')
  11. for quote in quotes:
  12. text=quote.css('.text::text').extract_first()
  13. author=quote.css('.author::text').extract_first()
  14. tags=quote.css('.tags .tag::text').extract()
  1.         #或<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
            tags=quote.css('.keywords::attr(content)').extract_first()

5.使用Item

【说明】上文定义了Item,接下来就要使用它了。Item可以理解为一个字典,不过在声明的时候需要实例化。然后依次用刚才解析的结果赋值Item的每一个字段,最后将Item返回即可。
Quotes Spider的改写quotes.py如下所示:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from tutorial.items import QuoteItem
  4.  
  5. class QuotesSpider(scrapy.Spider):
  6. name = "quotes"
  7. allowed_domains = ["quotes.toscrape.com"]
  8. start_urls = ['http://quotes.toscrape.com/']
  9.  
  10. def parse(self, response):
  11. quotes = response.css('.quote')
  12. for quote in quotes:
  13. item = QuoteItem()
  14. item['text'] = quote.css('.text::text').extract_first()
  15. item['author'] = quote.css('.author::text').extract_first()
  16. item['tags'] = quote.css('.tags .tag::text').extract()
  17. yield item

【代码解释】如此,所有首页内容被解析出来,赋值成一个个QuotesItem

6.后续Request(生成下一页请求)

【说明】抓取下一页内容:从当前页找到一个信息,生成下一页请求,从而往复迭代,抓取整站

【分析】找到下一页面按钮分析源码:

【方法说明】构造请求需用到:scrapy.Request(url=url, callback=self.parse)

参数1:url是请求的链接;

参数2:callback是回调函数。当指定了该回调函数的请求完成之后,获取到响应,引擎会将该响应作为参数传递给这个回调函数。回调函数进行解析或生成下一个请求,回调函数即刚写的 parse()函数。

由于 parse()就是解析text、 author、tags的方法,而下一页的结构和刚才已经解析的页面结构是一样的,所以我们可以再次使用 parse()方法来做页面解析。

【操作】写的下一页代码如下:

  1. next = response.css('.pager .next a::attr(href)').extract_first()
  2. url = response.urljoin(next)
  3. yield scrapy.Request(url=url, callback=self.parse)

【源码说明】:

第1行首先通过CSS选择器获取下一个页面的链接,即要获取a超链接中的href属性。这里用到了:attr(href)操作。然后再调用 extract first()方法获取内容。
第2行调用了 urljoin()方法, urljoin()方法可以将相对URL构造成一个绝对的URL。例如,获取到的下一页地址是page2, unjoint()方法处理后得到的结果就是:htp/ quotes.toscrape. com/page/2/
第3行通过ur1和 callback变量构造了一个新的请求,回调函数 callback依然使用 parse()方法。这个请求完成后,响应会重新经过 parse方法处理,得到第2页的解析结果,然后生成第2页的下一页,即第3页的请求

这样爬虫就进入了一个循环,直到最后一页。通过几行代码,我们就轻松实现了一个抓取循环,将每个页面的结果抓取下来了

【quotes.py 最终代码】如下:

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3.  
  4. from tutorial.items import QuoteItem
  5.  
  6. class QuotesSpider(scrapy.Spider):
  7. name = "quotes"
  8. allowed_domains = ["quotes.toscrape.com"]
  9. start_urls = ['http://quotes.toscrape.com/']
  10.  
  11. def parse(self, response):
  12. quotes = response.css('.quote')
  13. for quote in quotes:
  14. item = QuoteItem()
  15. item['text'] = quote.css('.text::text').extract_first()
  16. item['author'] = quote.css('.author::text').extract_first()
  17. item['tags'] = quote.css('.tags .tag::text').extract()
  18. yield item
  19.  
  20. next = response.css('.pager .next a::attr(href)').extract_first()
  21. url = response.urljoin(next)
  22. yield scrapy.Request(url=url, callback=self.parse)

7.运行爬虫

【运行前】记得把 settings.py中的,顺从爬虫规则去年注释,并改成False,否则,默认为Ture,可能会过滤掉许多网址

  1. # Obey robots.txt rules,默认为Ture,即按网站规则来,把有些网址过滤掉
  2. ROBOTSTXT_OBEY = False

【运行爬虫】进入项目目录(E:\a\scrapy_test\tutorial),运行命令即可运行爬虫,进行爬取:

  1. scrapy crawl quotes

【结果】如下:

  1. scrapy crawl quotes
  2.  
  3. 'tags': ['friendship', 'love'],
  4. 'text': '“There is nothing I would not do for those who are...'}
  5. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  6. .toscrape.com/page/6/>
  7. {'author': 'Eleanor Roosevelt',
  8. 'tags': ['attributed', 'fear', 'inspiration'],
  9. 'text': '“Do one thing every day that scares you.”'}
  10. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  11. .toscrape.com/page/6/>
  12. {'author': 'Marilyn Monroe',
  13. 'tags': ['attributed-no-source'],
  14. 'text': '“I am good, but not an angel. I do sin, but I am n...'}
  15. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  16. .toscrape.com/page/6/>
  17. {'author': 'Albert Einstein',
  18. 'tags': ['music'],
  19. 'text': '“If I were not a physicist, I would probably be a...'}
  20. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  21. .toscrape.com/page/6/>
  22. {'author': 'Haruki Murakami',
  23. 'tags': ['books', 'thought'],
  24. 'text': '“If you only read the books that everyone else is...'}
  25. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  26. .toscrape.com/page/6/>
  27. {'author': 'Alexandre Dumas fils',
  28. 'tags': ['misattributed-to-einstein'],
  29. 'text': '“The difference between genius and stupidity is: g...'}
  30. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  31. .toscrape.com/page/6/>
  32. {'author': 'Stephenie Meyer',
  33. 'tags': ['drug', 'romance', 'simile'],
  34. 'text': "“He's like a drug for you, Bella.”"}
  35. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  36. .toscrape.com/page/6/>
  37. {'author': 'Ernest Hemingway',
  38. 'tags': ['books', 'friends', 'novelist-quotes'],
  39. 'text': '“There is no friend as loyal as a book.”'}
  40. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  41. .toscrape.com/page/6/>
  42. {'author': 'Helen Keller',
  43. 'tags': ['inspirational'],
  44. 'text': '“When one door of happiness closes, another opens;...'}
  45. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  46. .toscrape.com/page/6/>
  47. {'author': 'George Bernard Shaw',
  48. 'tags': ['inspirational', 'life', 'yourself'],
  49. 'text': "“Life isn't about finding yourself. Life is about..."}
  50. 2019-04-26 13:53:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes
  51. .toscrape.com/page/7/> (referer: http://quotes.toscrape.com/page/6/)
  52. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  53. .toscrape.com/page/7/>
  54. {'author': 'Charles Bukowski',
  55. 'tags': ['alcohol'],
  56. 'text': "“That's the problem with drinking, I thought, as I..."}
  57. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  58. .toscrape.com/page/7/>
  59. {'author': 'Suzanne Collins',
  60. 'tags': ['the-hunger-games'],
  61. 'text': '“You don’t forget the face of the person who was y...'}
  62. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  63. .toscrape.com/page/7/>
  64. {'author': 'Suzanne Collins',
  65. 'tags': ['humor'],
  66. 'text': "“Remember, we're madly in love, so it's all right..."}
  67. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  68. .toscrape.com/page/7/>
  69. {'author': 'C.S. Lewis',
  70. 'tags': ['love'],
  71. 'text': '“To love at all is to be vulnerable. Love anything...'}
  72. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  73. .toscrape.com/page/7/>
  74. {'author': 'J.R.R. Tolkien',
  75. 'tags': ['bilbo', 'journey', 'lost', 'quest', 'travel', 'wander'],
  76. 'text': '“Not all those who wander are lost.”'}
  77. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  78. .toscrape.com/page/7/>
  79. {'author': 'J.K. Rowling',
  80. 'tags': ['live-death-love'],
  81. 'text': '“Do not pity the dead, Harry. Pity the living, and...'}
  82. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  83. .toscrape.com/page/7/>
  84. {'author': 'Ernest Hemingway',
  85. 'tags': ['good', 'writing'],
  86. 'text': '“There is nothing to writing. All you do is sit do...'}
  87. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  88. .toscrape.com/page/7/>
  89. {'author': 'Ralph Waldo Emerson',
  90. 'tags': ['life', 'regrets'],
  91. 'text': '“Finish each day and be done with it. You have don...'}
  92. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  93. .toscrape.com/page/7/>
  94. {'author': 'Mark Twain',
  95. 'tags': ['education'],
  96. 'text': '“I have never let my schooling interfere with my e...'}
  97. 2019-04-26 13:53:50 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  98. .toscrape.com/page/7/>
  99. {'author': 'Dr. Seuss',
  100. 'tags': ['troubles'],
  101. 'text': '“I have heard there are troubles of more than one...'}
  102. 2019-04-26 13:53:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes
  103. .toscrape.com/page/8/> (referer: http://quotes.toscrape.com/page/7/)
  104. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  105. .toscrape.com/page/8/>
  106. {'author': 'Alfred Tennyson',
  107. 'tags': ['friendship', 'love'],
  108. 'text': '“If I had a flower for every time I thought of you...'}
  109. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  110. .toscrape.com/page/8/>
  111. {'author': 'Charles Bukowski',
  112. 'tags': ['humor'],
  113. 'text': '“Some people never go crazy. What truly horrible l...'}
  114. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  115. .toscrape.com/page/8/>
  116. {'author': 'Terry Pratchett',
  117. 'tags': ['humor', 'open-mind', 'thinking'],
  118. 'text': '“The trouble with having an open mind, of course,...'}
  119. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  120. .toscrape.com/page/8/>
  121. {'author': 'Dr. Seuss',
  122. 'tags': ['humor', 'philosophy'],
  123. 'text': '“Think left and think right and think low and thin...'}
  124. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  125. .toscrape.com/page/8/>
  126. {'author': 'J.D. Salinger',
  127. 'tags': ['authors', 'books', 'literature', 'reading', 'writing'],
  128. 'text': '“What really knocks me out is a book that, when yo...'}
  129. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  130. .toscrape.com/page/8/>
  131. {'author': 'George Carlin',
  132. 'tags': ['humor', 'insanity', 'lies', 'lying', 'self-indulgence', 'truth'],
  133. 'text': '“The reason I talk to myself is because I’m the on...'}
  134. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  135. .toscrape.com/page/8/>
  136. {'author': 'John Lennon',
  137. 'tags': ['beatles',
  138. 'connection',
  139. 'dreamers',
  140. 'dreaming',
  141. 'dreams',
  142. 'hope',
  143. 'inspirational',
  144. 'peace'],
  145. 'text': "“You may say I'm a dreamer, but I'm not the only o..."}
  146. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  147. .toscrape.com/page/8/>
  148. {'author': 'W.C. Fields',
  149. 'tags': ['humor', 'sinister'],
  150. 'text': '“I am free of all prejudice. I hate everyone equal...'}
  151. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  152. .toscrape.com/page/8/>
  153. {'author': 'Ayn Rand',
  154. 'tags': [],
  155. 'text': "“The question isn't who is going to let me; it's w..."}
  156. 2019-04-26 13:53:51 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  157. .toscrape.com/page/8/>
  158. {'author': 'Mark Twain',
  159. 'tags': ['books', 'classic', 'reading'],
  160. 'text': "“′Classic′ - a book which people praise and don't..."}
  161. 2019-04-26 13:53:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes
  162. .toscrape.com/page/9/> (referer: http://quotes.toscrape.com/page/8/)
  163. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  164. .toscrape.com/page/9/>
  165. {'author': 'Albert Einstein',
  166. 'tags': ['mistakes'],
  167. 'text': '“Anyone who has never made a mistake has never tri...'}
  168. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  169. .toscrape.com/page/9/>
  170. {'author': 'Jane Austen',
  171. 'tags': ['humor', 'love', 'romantic', 'women'],
  172. 'text': "“A lady's imagination is very rapid; it jumps from..."}
  173. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  174. .toscrape.com/page/9/>
  175. {'author': 'J.K. Rowling',
  176. 'tags': ['integrity'],
  177. 'text': '“Remember, if the time should come when you have t...'}
  178. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  179. .toscrape.com/page/9/>
  180. {'author': 'Jane Austen',
  181. 'tags': ['books', 'library', 'reading'],
  182. 'text': '“I declare after all there is no enjoyment like re...'}
  183. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  184. .toscrape.com/page/9/>
  185. {'author': 'Jane Austen',
  186. 'tags': ['elizabeth-bennet', 'jane-austen'],
  187. 'text': '“There are few people whom I really love, and stil...'}
  188. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  189. .toscrape.com/page/9/>
  190. {'author': 'C.S. Lewis',
  191. 'tags': ['age', 'fairytales', 'growing-up'],
  192. 'text': '“Some day you will be old enough to start reading...'}
  193. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  194. .toscrape.com/page/9/>
  195. {'author': 'C.S. Lewis',
  196. 'tags': ['god'],
  197. 'text': '“We are not necessarily doubting that God will do...'}
  198. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  199. .toscrape.com/page/9/>
  200. {'author': 'Mark Twain',
  201. 'tags': ['death', 'life'],
  202. 'text': '“The fear of death follows from the fear of life....'}
  203. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  204. .toscrape.com/page/9/>
  205. {'author': 'Mark Twain',
  206. 'tags': ['misattributed-mark-twain', 'truth'],
  207. 'text': '“A lie can travel half way around the world while...'}
  208. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  209. .toscrape.com/page/9/>
  210. {'author': 'C.S. Lewis',
  211. 'tags': ['christianity', 'faith', 'religion', 'sun'],
  212. 'text': '“I believe in Christianity as I believe that the s...'}
  213. 2019-04-26 13:53:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes
  214. .toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
  215. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  216. .toscrape.com/page/10/>
  217. {'author': 'J.K. Rowling',
  218. 'tags': ['truth'],
  219. 'text': '“The truth." Dumbledore sighed. "It is a beautiful...'}
  220. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  221. .toscrape.com/page/10/>
  222. {'author': 'Jimi Hendrix',
  223. 'tags': ['death', 'life'],
  224. 'text': "“I'm the one that's got to die when it's time for..."}
  225. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  226. .toscrape.com/page/10/>
  227. {'author': 'J.M. Barrie',
  228. 'tags': ['adventure', 'love'],
  229. 'text': '“To die will be an awfully big adventure.”'}
  230. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  231. .toscrape.com/page/10/>
  232. {'author': 'E.E. Cummings',
  233. 'tags': ['courage'],
  234. 'text': '“It takes courage to grow up and become who you re...'}
  235. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  236. .toscrape.com/page/10/>
  237. {'author': 'Khaled Hosseini',
  238. 'tags': ['life'],
  239. 'text': '“But better to get hurt by the truth than comforte...'}
  240. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  241. .toscrape.com/page/10/>
  242. {'author': 'Harper Lee',
  243. 'tags': ['better-life-empathy'],
  244. 'text': '“You never really understand a person until you co...'}
  245. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  246. .toscrape.com/page/10/>
  247. {'author': "Madeleine L'Engle",
  248. 'tags': ['books',
  249. 'children',
  250. 'difficult',
  251. 'grown-ups',
  252. 'write',
  253. 'writers',
  254. 'writing'],
  255. 'text': '“You have to write the book that wants to be writt...'}
  256. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  257. .toscrape.com/page/10/>
  258. {'author': 'Mark Twain',
  259. 'tags': ['truth'],
  260. 'text': '“Never tell the truth to people who are not worthy...'}
  261. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  262. .toscrape.com/page/10/>
  263. {'author': 'Dr. Seuss',
  264. 'tags': ['inspirational'],
  265. 'text': "“A person's a person, no matter how small.”"}
  266. 2019-04-26 13:53:52 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes
  267. .toscrape.com/page/10/>
  268. {'author': 'George R.R. Martin',
  269. 'tags': ['books', 'mind'],
  270. 'text': '“... a mind needs books as a sword needs a whetsto...'}
  271. 2019-04-26 13:53:52 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET
  272. http://quotes.toscrape.com/page/10/> - no more duplicates will be shown (see DU
  273. PEFILTER_DEBUG to show all duplicates)
  274. 2019-04-26 13:53:52 [scrapy.core.engine] INFO: Closing spider (finished)
  275. 2019-04-26 13:53:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
  276. {'downloader/request_bytes': 2870,
  277. 'downloader/request_count': 11,
  278. 'downloader/request_method_count/GET': 11,
  279. 'downloader/response_bytes': 24812,
  280. 'downloader/response_count': 11,
  281. 'downloader/response_status_count/200': 10,
  282. 'downloader/response_status_count/404': 1,
  283. 'dupefilter/filtered': 1,
  284. 'finish_reason': 'finished',
  285. 'finish_time': datetime.datetime(2019, 4, 26, 5, 53, 52, 939800),
  286. 'item_scraped_count': 100,
  287. 'log_count/DEBUG': 112,
  288. 'log_count/INFO': 9,
  289. 'request_depth_max': 10,
  290. 'response_received_count': 11,
  291. 'robotstxt/request_count': 1,
  292. 'robotstxt/response_count': 1,
  293. 'robotstxt/response_status_count/404': 1,
  294. 'scheduler/dequeued': 10,
  295. 'scheduler/dequeued/memory': 10,
  296. 'scheduler/enqueued': 10,
  297. 'scheduler/enqueued/memory': 10,
  298. 'start_time': datetime.datetime(2019, 4, 26, 5, 53, 38, 767800)}
  299. 2019-04-26 13:53:52 [scrapy.core.engine] INFO: Spider closed (finished)
  300.  
  301. C:\Users\Administrator\Desktop\py\scrapytutorial>

【解释】结果:

a) 首先Scrap输出了当前的版本号以及正在启动的项目名称。
b) 接着输出了当前 settings. py中一些重写后的配置。
c) 然后输出了当前所应用的 Middlewares和 Pipelines Middlewares默认是启用的,可以在 settings.py中修改。
d) Pipelines默认是空,同样也可以在 settings. py中配置。后面会对它们进行讲解。
e) 接下来就是输出各个页面的抓取结果了,可以看到爬虫一边解析,一边翻页,直至将所有内容抓取完毕,然后终止。
f) 最后, Scrap输出了整个抓取过程的统计信息,如请求的字节数、请求次数、响应次数、完成原因等。

整个 Scrap程序成功运行。我们通过非常简单的代码就完成了一个网站内容的爬取,这样相比之前一点点写程序简洁很多。

8.1保存到文件json csv xml pickle marshal

【说明】运行完 Scrap后,我们只在控制台看到了输出结果。如果想保存结果可直接用, Scrap提供的【 Feed Exports】轻松将抓取结果输出。

【操作a】 将上面的结果【保存成JSON文件】,可以执行如下命令:

  1. #CMD命令,完成后,目录下多出一个quotes.json文件即结果
  2. scrapy crawl quotes -o quotes.json

【操作b】还可以【每一个Item输出一行】JSON,输出后缀为jl,为 jsonline的缩写,命令如下所示:

  1. #cmd命令:每个item导出为一行的joson文件。写法1
  2. scrapy crawl quotes -o quotes.jl
  3.  
  4. #-----或-------:
  5.  
  6. #cmd命令:每个item导出为一行的joson文件.写法2
  7. scrapy crawl quotes -o quotes.jsonlines

【操作c】其它格式输出:

1.输出格式还支持很多种,例如csv、xml、 pickle、 marshal等,还支持仰、s3等远程输出;
2.另外还可以通过自定义 ItemExporter来实现其他的输出。
【操作】下面命令对应的输出分别为【csv、xml、 pickle、 marshal格式、ftp远程输出】:

  1. #cmd命令:其它格式输出
  2. scrapy crawl quotes -o quotes.csv
  3. scrapy crawl quotes -o quotes.xml
  4. scrapy crawl quotes -o quotes.pickle
  5. scrapy crawl quotes -o quotes.marshal
  6.  
  7. #ftp输出需要正确配置用户名、密码、地址、输岀路径,否则会报错。其它ftp文件格式参考上一段
  8. scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv

8.2其它复杂格式输出(用Pipline.py 输出到数据库)

【引言】通过 Scrap提供的 Feed Exports,我们可以轻松地输出抓取结果到文件。对于一些小型项目来说,这应该足够了。如想要更复杂的输出,如输出到数据库等,则可用 Item Pipeline来完成

【目标】如果想进行更复杂的操作,如【将结果保存到 MongoDB数据库】,或者筛选某些有用的Iem,则我们可以定义 Item Pipeline来实现

【操作说明】Item Pipeline为项目管道。当Item生成后,它会自动被送到 Item Pipeline进行处理,我们【常用ltem Pipeline来做如下操作】:
§ 清理HTML数据。
§ 验证爬取数据,检查爬取字段。
§ 查重并丢弃重复内容。
§ 将爬取结果保存到数据库。

【用到的方法】很简单,只需要定义一个类并实现 process_item()方法即可。启用 Item Pipeline后, Item Pipeline会自动调用这个方法:

process_item()方法必须返回包含数据的字典或Item对象 或 者抛出 DropItem异常。

process_item()方法有两个参数。一个参数是item,每次 Spider生成的Item都会作为参数传递过来。

另一个参数是 spider,就是 Spider的实例。

【实操】接下来,我们实现一个 Item Pipeline,筛掉text长度大于50的Item,并将结果保存到 MongoDB:

【步骤1】修改项目里的  [ pipelines.py ] 文件,之前用命令行自动生成的文件内容可以删掉,增加一个TextPipeline类,内容如下所示:

  1. import pymongo
  2. from scrapy.exceptions import DropItem
  3.  
  4. #新写部分
  5. class TextPipeline(object):
  6. def __init__(self):
  7. self.limit = 50
  8.  
  9. def process_item(self, item, spider):
  10. if item['text']:
  11. if len(item['text']) > self.limit:
  12. item['text'] = item['text'][0:self.limit].rstrip() + '...'
  13. return item
  14. else:
  15. return DropItem('Missing Text')

【代码说明】:这段代码在构造方法里定义了限制长度为50,实现了 process item()方法,其参数是item和spider。首先该方法判断item的text属性是否存在,如果不存在,则抛出 DropItem异常;如果存在,再判断长度是否大于50,如果大于,那就截断然后拼接省略号,再将item返回即可。

【步骤2】接下来,我们将处理后的【item存入 MongoDB】,定义另外一个 Pipeline。同样在 pipelines. py中,我们实现另一个类 MongoPipeline,内容如下所示:

  1. class MongoPipeline(object):
  2. def __init__(self, mongo_uri, mongo_db):
  3. self.mongo_uri = mongo_uri
  4. self.mongo_db = mongo_db
  5.  
  6. @classmethod
  7. def from_crawler(cls, crawler):
  8. return cls(
  9. mongo_uri=crawler.settings.get('MONGO_URI'),
  10. mongo_db=crawler.settings.get('MONGO_DB')
  11. )
  12.  
  13. def open_spider(self, spider):
  14. self.client = pymongo.MongoClient(self.mongo_uri)
  15. self.db = self.client[self.mongo_db]
  16.  
  17. def process_item(self, item, spider):
  18. name = item.__class__.__name__
  19. self.db[name].insert(dict(item))
  20. return item
  21.  
  22. def close_spider(self, spider):
  23. self.client.close()

【步骤3】MongoPipeline类实现了API定义的另外几个方法。

§ from crawler。它是一个类方法,用@ classmethod标识,是一种依赖注入的方式。它的参数就是 crawler,通过 crawler我们可以拿到全局配的每个配置信息。在全局配置 settings.py中,我们可以定义 MONGO URI和MoN0DB来指定 MongoDB连接需要的地址和数据库名称,拿到配置信息之后返回类对象即可。所以这个方法的定义主要是用来获取 settings. py中的配置的。
§ open spider。当 Spider开启时,这个方法被调用。上文程序中主要进行了一些初始化操作
§ close spider。当 Spider关闭时,这个方法会调用。上文程序中将数据库连接关闭。

最主要的 process item()方法则执行了数据插入操作。定义好 TextPipeline和 MongoPipeline这两个类后,我们需要在 [ settings.py ] 中使用它们 MongoDB的连接信息还需要定义,在其中加入如下信息:

  1. ITEM_PIPELINES = {
  2. 'tutorial.pipelines.TextPipeline': 300,
  3. 'tutorial.pipelines.MongoPipeline': 400,
  4. }
  5. MONGO_URI='localhost'
  6. MONGO_DB='tutorial'

【代码说明】赋值 ITEM PIPELINES字典,键名是 Pipeline的类名称,键值是调用优先级,是一个数字,数字越小则对应的 Pipeline越先被调用。

【步骤4】再重新执行爬取,命令如下所示:

  1. #重新运行,爬取
  2. scrapy crawl quotes

【代码说明】爬取结束后, MongoDB中创建了一个 tutorial的数据库、 Quoteltem的表,如图所示:

【说明】其中text长的部分已经被追加省略号…短的不变,其它author,tags也有加入其中

【结语】至此已经完成最简单用scrapy爬取quotes网站。但更复杂的功能还需进一步学习。

实例代码

quote.py

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from tutorial.items import TutorialItem
  4.  
  5. class QuotesSpider(scrapy.Spider):
  6. name = 'quotes'
  7. allowed_domains = ['quotes.toscrape.com']
  8. start_urls = ['http://quotes.toscrape.com/']
  9.  
  10. def parse(self, response):
  11. quotes=response.css(".quote")
  12. for quote in quotes:
  13. item=TutorialItem()
  14. item['text']=quote.css(".text::text").extract_first()
  15. item['author']=quote.css('.author::text').extract_first()
  16.  
  17. #tags=quote.css('.tags .tag::text').extract()
  18. #或<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
  19. item['tags']=quote.css('.keywords::attr(content)').extract_first()
  20. yield item
  21. #print(text,author,tags)
  22. #下一页
  23. next=response.css('.next >a::attr(href)').extract_first()
  24. #函数.urljoin()把当前网址:http://quotes.toscrape.com/+next获取的 拼接成完整网址
  25. url=response.urljoin(next)
  26. yield scrapy.Request(url=url, callback=self.parse)

settings.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Scrapy settings for tutorial project
  4. #
  5. # For simplicity, this file contains only settings considered important or
  6. # commonly used. You can find more settings consulting the documentation:
  7. #
  8. # https://doc.scrapy.org/en/latest/topics/settings.html
  9. # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  10. # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  11.  
  12. BOT_NAME = 'tutorial'
  13.  
  14. SPIDER_MODULES = ['tutorial.spiders']
  15. NEWSPIDER_MODULE = 'tutorial.spiders'
  16.  
  17. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  18. #USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
  19.  
  20. # Obey robots.txt rules
  21. ROBOTSTXT_OBEY = True
  22.  
  23. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  24. #CONCURRENT_REQUESTS = 32
  25.  
  26. # Configure a delay for requests for the same website (default: 0)
  27. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  28. # See also autothrottle settings and docs
  29. #DOWNLOAD_DELAY = 3
  30. # The download delay setting will honor only one of:
  31. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  32. #CONCURRENT_REQUESTS_PER_IP = 16
  33.  
  34. # Disable cookies (enabled by default)
  35. #COOKIES_ENABLED = False
  36.  
  37. # Disable Telnet Console (enabled by default)
  38. #TELNETCONSOLE_ENABLED = False
  39.  
  40. # Override the default request headers:
  41. #DEFAULT_REQUEST_HEADERS = {
  42. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  43. # 'Accept-Language': 'en',
  44. #}
  45.  
  46. # Enable or disable spider middlewares
  47. # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  48. #SPIDER_MIDDLEWARES = {
  49. # 'tutorial.middlewares.TutorialSpiderMiddleware': 543,
  50. #}
  51.  
  52. # Enable or disable downloader middlewares
  53. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  54. #DOWNLOADER_MIDDLEWARES = {
  55. # 'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
  56. #}
  57.  
  58. # Enable or disable extensions
  59. # See https://doc.scrapy.org/en/latest/topics/extensions.html
  60. #EXTENSIONS = {
  61. # 'scrapy.extensions.telnet.TelnetConsole': None,
  62. #}
  63.  
  64. # Configure item pipelines
  65. # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  66. #ITEM_PIPELINES = {
  67. # 'tutorial.pipelines.TutorialPipeline': 300,
  68. #}
  69. ITEM_PIPELINES = {
  70. 'tutorial.pipelines.TutorialPipeline': 300,
  71. 'tutorial.pipelines.MongoPipeline': 400,
  72. }
  73. MONGO_URI='localhost'
  74. MONGO_DB='tutorial2'
  75.  
  76. # Enable and configure the AutoThrottle extension (disabled by default)
  77. # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  78. #AUTOTHROTTLE_ENABLED = True
  79. # The initial download delay
  80. #AUTOTHROTTLE_START_DELAY = 5
  81. # The maximum download delay to be set in case of high latencies
  82. #AUTOTHROTTLE_MAX_DELAY = 60
  83. # The average number of requests Scrapy should be sending in parallel to
  84. # each remote server
  85. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  86. # Enable showing throttling stats for every response received:
  87. #AUTOTHROTTLE_DEBUG = False
  88.  
  89. # Enable and configure HTTP caching (disabled by default)
  90. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  91. #HTTPCACHE_ENABLED = True
  92. #HTTPCACHE_EXPIRATION_SECS = 0
  93. #HTTPCACHE_DIR = 'httpcache'
  94. #HTTPCACHE_IGNORE_HTTP_CODES = []
  95. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

piplines.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Define your item pipelines here
  4. #
  5. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  6. # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  7. import pymongo
  8. from scrapy.exceptions import DropItem
  9.  
  10. class TutorialPipeline(object):
  11. def __init__(self):
  12. self.limit=50
  13.  
  14. def process_item(self, item, spider):
  15. if item['text']:
  16. if len(item['text']) > self.limit:
  17. item['text']=item['text'][0:self.limit].rstrip()+"..."
  18. return item
  19. else:
  20. return DropItem("text were missing!")
  21.  
  22. class MongoPipeline(object):
  23. def __init__(self, mongo_uri, mongo_db):
  24. self.mongo_uri = mongo_uri
  25. self.mongo_db = mongo_db
  26.  
  27. @classmethod
  28. def from_crawler(cls, crawler):
  29. return cls(
  30. mongo_uri=crawler.settings.get('MONGO_URI'),
  31. mongo_db=crawler.settings.get('MONGO_DB')
  32. )
  33.  
  34. def open_spider(self, spider):
  35. self.client = pymongo.MongoClient(self.mongo_uri)
  36. self.db = self.client[self.mongo_db]
  37.  
  38. def process_item(self, item, spider):
  39. name = item.__class__.__name__
  40. self.db[name].insert(dict(item))
  41. return item
  42.  
  43. def close_spider(self, spider):
  44. self.client.close()

items.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Define here the models for your scraped items
  4. #
  5. # See documentation in:
  6. # https://doc.scrapy.org/en/latest/topics/items.html
  7.  
  8. import scrapy
  9.  
  10. class TutorialItem(scrapy.Item):
  11. # define the fields for your item here like:
  12. # name = scrapy.Field()
  13. text=scrapy.Field()
  14. author=scrapy.Field()
  15. tags=scrapy.Field()
  16. #pass

main.py

  1. # -*- coding: utf-8 -*-
  2. __author__='pasaulis'
  3. #在程序中运行命令行,方法调试,如:在jobbole.py中打个断点,运行就会停在那
  4.  
  5. from scrapy.cmdline import execute
  6. import sys,os
  7.  
  8. #获取到当前目录:E:\a\scrapy_test\Aticle,方便后面cmd命令运行不必去找目录
  9. sys.path.append(os.path.dirname(os.path.abspath(__file__)))
  10. #测试目录获取是否正确
  11. # print(os.path.dirname(os.path.abspath(__file__)))
  12.  
  13. #调用命令运行爬虫scrapy crawl quotes
  14. execute(["scrapy","crawl","quotes"])

8.scrapy的第一个实例的更多相关文章

  1. 【C# -- OpenCV】Emgu CV 第一个实例

    原文 [C# -- OpenCV]Emgu CV 第一个实例 Emgu CV下载地址 http://sourceforge.net/projects/emgucv/files/ 找最新的下就行了,傻瓜 ...

  2. Thrift教程初级篇——thrift安装环境变量配置第一个实例

    前言: 因为项目需要跨语言,c++客户端,web服务端,远程调用等需求,所以用到了RPC框架Thrift,刚开始有点虚,第一次接触RPC框架,后来没想到Thrift开发方便上手快,而且性能和稳定性也不 ...

  3. Konckout第一个实例:简单数据模型绑定

    Konck是什么: http://www.aizhengli.com/knockoutjs/50/knockout.html 使用:直接引入knockout.js文件 第一个实例:实现输入框输入值改变 ...

  4. [转]Scrapy简单入门及实例讲解

    Scrapy简单入门及实例讲解 中文文档:   http://scrapy-chs.readthedocs.io/zh_CN/0.24/ Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用 ...

  5. Vue.js学习和第一个实例

    第一个实例效果图: 1.node.js下载,然后安装.下载地址:链接:http://pan.baidu.com/s/1o7TONhS 密码:fosa 2.下载Vue.js.链接:http://pan. ...

  6. Highmaps网页图表教程之Highmaps第一个实例与图表构成

    Highmaps网页图表教程之Highmaps第一个实例与图表构成 Highmaps第一个实例 下面我们来实现本教程的第一个Highmaps实例. [实例1-1:hellomap]下面来制作一个中国地 ...

  7. JAVA-MyEclipse第一个实例

    相关资料: <21天学通Java Web开发> 实例代码: MyEclipse第一个实例1.打开MyEclipse程序.2.在PacKage视图->右击->New|Web Pr ...

  8. C# 实现程序只启动一次(多次运行激活第一个实例,使其获得焦点,并在最前端显示)

    防止程序运行多个实例的方法有多种,如:通过使用互斥量和进程名等.而我想要实现的是:在程序运行多个实例时激活的是第一个实例,使其获得焦点,并在前端显示. 主要用到两个API 函数: ShowWindow ...

  9. 小白的springboot之路(一)、环境搭建、第一个实例

    小白的springboot之路(一).环境搭建.第一个实例 0- 前言 Spring boot + spring cloud + vue 的微服务架构技术栈,那简直是爽得不要不要的,怎么爽法,自行度娘 ...

随机推荐

  1. js 转换时间戳为时间格式并且按指定格式输出

    /** * 时间戳转换为日期 */ function convertTimestamp(timestamp){ // 时间戳转换为日期 var d = new Date(timestamp); // ...

  2. springboot整合logback集成elk实现日志的汇总、分析、统计和检索功能

    在Spring Boot当中,默认使用logback进行log操作.logback支持将日志数据通过提供IP地址.端口号,以Socket的方式远程发送.在Spring Boot中,通常使用logbac ...

  3. UVA10820 交表 Send a Table

    \(\Large\textbf{Description:} \large{输入n,求有多少个二元组(x,y)满足:1\leqslant x,y\leqslant n,且x和y互素.}\) \(\Lar ...

  4. 阿里云https+nginx服务搭建

    购买证书 通过控制台进入CA证书服务,点击右上角的购买证书,进入如下图的界面,选择免费的Symantec的DV SSL. 一路点过去,然后回到证书服务主页,会出现一条订单信息,点击补全,如下图所示. ...

  5. leetcode1019 Next Greater Node In Linked List

    """ We are given a linked list with head as the first node. Let's number the nodes in ...

  6. Golang的运算符-算数运算符

    Golang的运算符-算数运算符 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.算术运算符概述 常见的算术运算符: +: 可表示正号,如",结果为"Jaso ...

  7. UVA - 122 Trees on the level (二叉树的层次遍历)

    题意:给定结点值和从根结点到该结点的路径,若根到某个叶结点路径上有的结点输入中未给出或给出超过一次,则not complete,否则层次遍历输出所有结点. 分析:先建树,建树的过程中,沿途结点都申请了 ...

  8. HihoCoder第十三周:最近公共祖先 一

    #1062 : 最近公共祖先·一 时间限制:10000ms 单点时限:1000ms 内存限制:256MB 描述 同城那样神奇,但这个网站仍然让小Ho乐在其中,但这是为什么呢? "为什么呢?& ...

  9. python中添加requests资源包

    1.进入资源网址下载:https://www.lfd.uci.edu/~gohlke/pythonlibs/ 2.按下CTRL+F进行页面查找“requests” 3.点击requests-2.22. ...

  10. 124-PHP类析构函数

    <?php class myclass{ //定义一个类 public function __destruct(){ //定义析构方法 echo '析构方法执行.<br />'; } ...