<scrapy爬虫>爬取quotes.toscrape.com

1.创建scrapy项目

dos窗口输入:

scrapy startproject quote

cd quote

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

import scrapy

class QuoteItem(scrapy.Item):

    # define the fields for your item here like:

    text = scrapy.Field()

    author = scrapy.Field()

    tags = scrapy.Field()

3.创建爬虫文件

dos窗口输入:

scrapy genspider myspider quotes.toscrape.com

4.编写myspider.py文件(接收响应,处理数据)

# -*- coding: utf-8 -*-

import scrapy

from quote.items import QuoteItem

class MyspiderSpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['quotes.toscrape.com']

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        for each in response.xpath('//div[@class="quote"]'):

            item = QuoteItem()

            item['text'] = each.xpath('./span/text()').extract()[0]

            item['author'] = each.xpath('.//small/text()').extract()[0]

            list = each.xpath('.//a[@class="tag"]/text()').extract()

            #列表形式的文件不能存入mysql,需要弄成str形式

            item['tags']= '/'.join(list)

            yield item

        next = response.xpath('//li[@class="next"]/a/@href').extract()[0]

        url = response.urljoin(next)

        yield scrapy.Request(url=url,callback=self.parse)

5.编写pipelines.py(存储数据)

存储到mysql

import pymysql.cursors

class QuotePipeline(object):

    def __init__(self):

        self.connect = pymysql.connect(

            host='localhost',

            user='root',

            password='',

            database='quotes',

            charset='utf8',

        )

        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):

        item = dict(item)

        sql = 'insert into quote(text,author,tags) values(%s,%s,%s)'

        self.cursor.execute(sql,(item['text'],item['author'],item['tags']))

        self.connect.commit()

        return item

    def close_spider(self,spider):

        self.cursor.close()

        self.connect.close()

改进版:

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql.cursors

class QuotePipeline(object):

	def __init__(self):

		self.connect = pymysql.connect(

			host='localhost',

			user='root',

			password='',

			database='quotes',

			charset='utf8',

		)

		self.cursor = self.connect.cursor()

	def process_item(self, item, spider):

		item = dict(item)

		table = 'quote'

		keys = ','.join(item.keys())

		values = ','.join(['%s']*len(item))

		sql = 'insert into {table}({keys}) values({values})'.format(table=table,keys=keys,values=values)

		try:

			if self.cursor.execute(sql, tuple(item.values())):

				self.connect.commit()

				print("Successful!")

		except:

			print("Failed!")

			self.connect.rollback()

		return item

	def close_spider(self, spider):

		self.cursor.close()

		self.connect.close()

存储到mongoDB

　　1.在setting文件设置2个属性

MONGO_URI = 'localhost'

MONGO_DB = 'study'

#一个管道文件

ITEM_PIPELINES = {

   # 'quote.pipelines.QuotePipeline': 300,

   'quote.pipelines.MongoPipeline': 300,

}

　　2.pipeline.py

import pymongo

class MongoPipeline(object):

	# 表名字

	collection = 'student'

	def __init__(self, mongo_uri, mongo_db):

		self.mongo_uri = mongo_uri

		self.mongo_db = mongo_db

	@classmethod

	def from_crawler(cls, crawler):

		return cls(

			mongo_uri=crawler.settings.get('MONGO_URI'),

			mongo_db=crawler.settings.get('MONGO_DB'),

		)

	def open_spider(self, spider):

		self.client = pymongo.MongoClient(self.mongo_uri)

		self.db = self.client[self.mongo_db]

	def close_spider(self, spider):

		self.client.close()

	def process_item(self, item, spider):

		# 插入到mongo数据库

		self.db[self.collection].insert(dict(item))

		return item

6.编写settings.py(设置headers,pipelines等)

robox协议

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

headers

DEFAULT_REQUEST_HEADERS = {

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  # 'Accept-Language': 'en',

}

pipelines

ITEM_PIPELINES = {

   'quote.pipelines.QuotePipeline': 300,

}

7.运行爬虫

dos窗口输入:

scrapy crawl myspider

运行结果

<scrapy爬虫>爬取quotes.toscrape.com的更多相关文章

使用scrapy爬虫,爬取17k小说网的案例-方法一
无意间看到17小说网里面有一些小说小故事,于是决定用爬虫爬取下来自己看着玩,下图这个页面就是要爬取的来源. a 这个页面一共有125个标题,每个标题里面对应一个内容,如下图所示下面直接看最核心spi ...
<scrapy爬虫>爬取360妹子图存入mysql(mongoDB还没学会,学会后加上去)
1.创建scrapy项目 dos窗口输入: scrapy startproject images360 cd images360 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) ...
使用scrapy爬虫,爬取今日头条搜索吉林疫苗新闻（scrapy+selenium+PhantomJS）
这一阵子吉林疫苗案,备受大家关注,索性使用爬虫来爬取今日头条搜索吉林疫苗的新闻依然使用三件套(scrapy+selenium+PhantomJS)来爬取新闻以下是搜索页面,得到吉林疫苗的搜索信息, ...
<scrapy爬虫>爬取猫眼电影top100详细信息
1.创建scrapy项目 dos窗口输入: scrapy startproject maoyan cd maoyan 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # -*- ...
<scrapy爬虫>爬取校花信息及图片
1.创建scrapy项目 dos窗口输入: scrapy startproject xiaohuar cd xiaohuar 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # ...
<scrapy爬虫>爬取腾讯社招信息
1.创建scrapy项目 dos窗口输入: scrapy startproject tencent cd tencent 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # - ...
scrapy爬虫爬取小姐姐图片（不羞涩）
这个爬虫主要学习scrapy的item Pipeline 是时候搬出这张图了: 当我们要使用item Pipeline的时候,要现在settings里面取消这几行的注释我们可以自定义Item Pip ...
使用scrapy爬虫,爬取今日头条首页推荐新闻（scrapy+selenium+PhantomJS）
爬取今日头条https://www.toutiao.com/首页推荐的新闻,打开网址得到如下界面查看源代码你会发现全是js代码,说明今日头条的内容是通过js动态生成的. 用火狐浏览器F12查看得知 ...
使用scrapy爬虫,爬取起点小说网的案例
爬取的页面为https://book.qidian.com/info/1010734492#Catalog 爬取的小说为凡人修仙之仙界篇,这边小说很不错. 正文的章节如下图所示其中下面的章节为加密部 ...

随机推荐

PHP 实现斐波那契数列
使用循环实现 <?php $arr[1] = 1; for($i = 2;$i < 100;$i++) { $arr[$i] = $arr[$i-1] + $arr[$i-2]; } ec ...
自己的php框架
spl_autoload_register('imooc::load');当我们new的类不存在,将触发括号里的方法. is_file()判断文件是否存在.
ionic-CSS：ionic 按钮
ylbtech-ionic-CSS:ionic 按钮 1.返回顶部 1. onic 按钮按钮是移动app不可或缺的一部分,不同风格的app,需要的不同按钮的样式. 默认情况下,按钮显示样式为:dis ...
3步永久性激活IntelliJ IDEA 亲测有效
1.进到文件夹中:C:\Windows\System32\drivers\etc ,找到hosts文件,用文本编辑器打开文件,将“ 0.0.0.0 account.jetbrains.com ”添加 ...
好文 | MySQL 索引B+树原理，以及建索引的几大原则
Java技术栈 www.javastack.cn 优秀的Java技术公众号来源:小宝鸽 blog.csdn.net/u013142781/article/details/51706790 MySQL ...
【POJ】3259 Wormholes
题目链接:http://poj.org/problem?id=3259 题意:n个农场,m条双向路径,w条单向路径(虫洞).单向虫洞路径是负值.农夫想知道自己能不能看到自己(X). 题解:其实刚开始没 ...
Lucene TFIDFSimilarity评分公式详解
版权声明:本文为博主原创文章,遵循CC 4.0 by-sa版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/zteny/article/details/ ...
java.sql.SQLException: validateConnection false
-- :: --- [Create-] com.alibaba.druid.pool.DruidDataSource : create connection error java.sql.SQLExc ...
转载:ASP.NET Core 在 JSON 文件中配置依赖注入
在以前的 ASP.NET 4+ (MVC,Web Api,Owin,SingalR等)时候,都是提供了专有的接口以供使用第三方的依赖注入组件,比如我们常用的会使用 Autofac.Untiy.Stri ...
Windows系统查看xxx.dll、xxx.lib文件的导出函数、依赖文件等信息的方法
1.查看xxx.dll或xxx.exe文件的导出函数.依赖文件等信息,使用Depends软件即可. 2.查看xxx.lib文件的导出函数.依赖文件等信息,使用Visual Studio附带工具dump ...

<scrapy爬虫>爬取quotes.toscrape.com

1.创建scrapy项目

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

3.创建爬虫文件

4.编写myspider.py文件(接收响应,处理数据)

5.编写pipelines.py(存储数据)

存储到mysql

改进版:

存储到mongoDB

6.编写settings.py(设置headers,pipelines等)

7.运行爬虫

<scrapy爬虫>爬取quotes.toscrape.com的更多相关文章

随机推荐

热门专题