19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)

关于 spinder 在这个框架里面和不用数据库相同

# -*- coding: utf-8 -*-

import scrapy

from yang_guan.items import YangGuanItem

from copy import deepcopy

from scrapy.spiders import CrawlSpider

class YgSpider(scrapy.Spider):

    name = 'yg'

    allowed_domains = ['huanqiu.com']

    start_urls = ['http://www.huanqiu.com/',

                  ]

    def parse (self, response):  # 总页面  第一个一定要用parse  用来传递start_urls

        item = YangGuanItem()

        # item = {}

        class_news_urls_li = response.xpath(".//div[@class='navCon']/ul/li/a")

        print(class_news_urls_li)

        for class_news_url in class_news_urls_li:

            item["class_tittle"] = class_news_url.xpath("./text()").extract_first()

            print(item)

            new_url = class_news_url.xpath("./@href").extract_first()

            print(new_url)

            yield scrapy.Request(

                new_url,

                callback=self.second_class,

                meta={"item": deepcopy(item)},  # 由于是多线程 所以要用深拷贝进入item

            )

    def second_class(self, response):  # 二级页面

        item = response.meta["item"]

        print(response.url)

        second_urls = response.xpath(".//div/h2/em")

        for second_url in second_urls:

            secoond_news_url = second_url.xpath("./a/@href").extract_first()

            yield scrapy.Request(

                secoond_news_url,

                callback=self.parse_detail_analyze,

                meta={"item": deepcopy(item)}

            )

    def parse_detail_analyze(self, response):  # 进入第三成  总细节的抓取  http://china.huanqiu.com/leaders/'

        item = response.meta["item"]

        li_list = response.xpath("//ul[@class='listPicBox']/li")

        for li in li_list:

            # item = YangGuanItem()

            item["title"] = li.xpath("./h3/a/text()").extract_first()

            item["img_url"] = li.xpath("./a/img/@src").extract_first()

            item["detail"] = li.xpath("./h5/text()").extract_first()

            yield item

        next_url = response.xpath(".//div[@class='pageBox']/div/a[last()]/@href").extract_first()  # 遇见翻页就要这样写

        yield scrapy.Request(next_url, callback=self.parse_detail_analyze,meta={"item":response.meta["item"]})

关于 pipelines 的管道设定

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

class YangGuanPipeline(object):

    def __init__(self):

        # 建立mongodb 数据库连接

        client = pymongo.MongoClient('127.0.0.1', 27017)

        # 连接数据库,['scrapy_huan_qiu]

        db = client['scrapy_huan_qiu']

        # 连接所用的集合

        self.post = db['zong_huan_qiu']

        print("*"*100)

    def process_item(self, item, spider):

        postItem = dict(item)

        self.post.insert(postItem)

        return item

setting 的设置

# -*- coding: utf-8 -*-

# Scrapy settings for yang_guan project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#  好像记得是ip代理

PROXIES = [

    {'ip_port': '111.11.228.75:80', 'user_pass': ''},

    {'ip_port': '120.198.243.22:80', 'user_pass': ''},

    {'ip_port': '111.8.60.9:8123', 'user_pass': ''},

    {'ip_port': '101.71.27.120:80', 'user_pass': ''},

    {'ip_port': '122.96.59.104:80', 'user_pass': ''},

    {'ip_port': '122.224.249.122:8088', 'user_pass': ''},]

BOT_NAME = 'yang_guan'

SPIDER_MODULES = ['yang_guan.spiders']

NEWSPIDER_MODULE = 'yang_guan.spiders'

# LOG_LEVEL = "WARNING"

# Crawl responsibly by identifying yourself (and your website) on the user-agent

# 计算机型号防止反爬虫

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'

# Obey robots.txt rules

# 不遵守爬虫机器人协议

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'yang_guan.middlewares.YangGuanSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'yang_guan.middlewares.YangGuanDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#  开启管道  由于这次没有编写 items  也无法保存进入数据库

ITEM_PIPELINES = {

   'yang_guan.pipelines.YangGuanPipeline': 300,

}

#  关于  debug等级  和生成log日志

# LOG_FILE = "dg.log"

# LOG_LEVEL = "DEBUG"

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

关于item 的设置这个一定要有用spider 里面的 yield 来进行传递字典

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class YangGuanItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    class_tittle = scrapy.Field()

    img_url = scrapy.Field()

    detail = scrapy.Field()

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class YangGuanItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    class_tittle = scrapy.Field()

    img_url = scrapy.Field()

    detail = scrapy.Field()

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class YangGuanItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    class_tittle = scrapy.Field()

    img_url = scrapy.Field()

    detail = scrapy.Field()

19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)的更多相关文章

scrapy框架基于CrawlSpider的全站数据爬取
引入提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
爬虫入门之Scrapy框架基础框架结构及腾讯爬取(十)
Scrapy终端是一个交互终端,我们可以在未启动spider的情况下尝试及调试代码,也可以用来测试XPath或CSS表达式,查看他们的工作方式,方便我们爬取的网页中提取的数据. 如果安装了 IPyth ...
Scrapy实战篇（五）之爬取历史天气数据
本篇文章我们以抓取历史天气数据为例,简单说明数据抓取的两种方式: 1.一般简单或者较小量的数据需求,我们以requests(selenum)+beautiful的方式抓取数据 2.当我们需要的数据量较 ...
Scrapy实战篇（八）之爬取教育部高校名单抓取和分析
本节我们以网址https://daxue.eol.cn/mingdan.shtml为初始链接,爬取教育部公布的正规高校名单. 思路: 1.首先以上面的地址开始链接,抓取到下面省份对应的链接. 2.在解 ...
13，scrapy框架的日志等级和请求传参
今日概要日志等级请求传参如何提高scrapy的爬取效率一.Scrapy的日志等级 - 在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy ...
scrapy爬虫笔记(三)------写入源文件的爬取
开始爬取网页:(2)写入源文件的爬取为了使代码易于修改,更清晰高效的爬取网页,我们将代码写入源文件进行爬取. 主要分为以下几个步骤: 一.使用scrapy创建爬虫框架: 二.修改并编写源代码,确定我 ...
Scrapy实战篇（七）之爬取爱基金网站基金业绩数据
本篇我们以scrapy+selelum的方式来爬取爱基金网站(http://fund.10jqka.com.cn/datacenter/jz/)的基金业绩数据. 思路:我们以http://fund.1 ...
Scrapy实战篇（六）之爬取360图片数据和图片
本篇文章我们以360图片为例,介绍scrapy框架的使用以及图片数据的下载. 目标网站:http://images.so.com/z?ch=photography 思路:分析目标网站为ajax加载方式 ...
Scrapy实战篇（三）之爬取豆瓣电影短评
今天的主要内容是爬取豆瓣电影短评,看一下网友是怎么评价最近的电影的,方便我们以后的分析,以以下三部电影:二十二,战狼,三生三世十里桃花为例. 由于豆瓣短评网页比较简单,且不存在动态加载的内容,我们下面 ...

随机推荐

vector的clear和swap
vector的clear()操作只是清空vector的元素,而不会将内存释放掉 vector<int> vec1{ 1,2,3,4,5 }; vec1.clear(); cout<& ...
学习SpringMVC 文件上传遇到的问题，403:returned a response status of 403 Forbidden ，409文件夹未找到
问题一: 409:文件夹没有创建好,找不到指定的文件夹,创建了即可. 问题二: 403:returned a response status of 403 Forbidden 我出现这个错误的原因是因 ...
FFmpeg调用c语言SDK实现日志的打印
日志文件的三大步 // 导入头文件 #include <libavutil/log.h> // 设置日志级别 av_log_set_level(AV_LOG_DEBUG); //DEBUG ...
SQLite、MySQL和PostgreSQL 三种关系数据库哪个好？
关系型数据库的使用已经有相当长的时间了.它们变得流行起来托了管理系统的福,关系模型被实现得相当的好,并且被证明是操作数据的好方法(特别是事务性强的应用). 在这篇DigitalOcean文章中,我们将 ...
CentOS6.9安装MySQL(编译安装、二进制安装)
目录 CentOS6.9安装MySQL Linux安装MySQL的4种方式: 1. 二进制方式特点:不需要安装,解压即可使用,不能定制功能 2. 编译安装特点:可定制,安装慢 5.5之前: ./c ...
Py2与Py3的区别
总结Py2 与Py3 的区别 1 编码区别在Python2中有两种字符串类型str和Unicode. 默认ASCII python2 str类型,相当于python3中的bytes类型 python ...
【剑指Offer面试编程题】题目1384：二维数组中的查找--九度OJ
题目描述: 在一个二维数组中,每一行都按照从左到右递增的顺序排序,每一列都按照从上到下递增的顺序排序.请完成一个函数,输入这样的一个二维数组和一个整数,判断数组中是否含有该整数. 输入: 输入可能包含 ...
嵌入式 printf的实现
在嵌入式中,经常需要用到printf来调试程序标准库函数的默认输出设备是显示器,要实现在串口或LCD输出,必须重定义标准库函数里调用的与输出设备相关的函数. printf输出到串口,需要将fputc ...
JDBC--Statement使用
1.通过Statement实现类执行更新操作(INSERT.UPDATE .DELETE): --1)获取数据库连接Connection的对象: --2)通过Connection类的createSta ...
gitlab导入备份数据
1.将南阳的gitlab 迁入到本地80虚拟机由于本地ip地址没有固定,所以,是本地去拉取南阳的代码,虽然,之后固定了ip,但,由于只用一次这样的操作,所以,还是一直在做拉取而不是推送的工作 2.具 ...

19 03 13 关于 scrapy 框架的 对环球网的整体爬取(存储于 mongodb 数据库里)

19 03 13 关于 scrapy 框架的 对环球网的整体爬取(存储于 mongodb 数据库里)的更多相关文章

随机推荐

热门专题

19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)

19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)的更多相关文章