19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)

关于 spinder 在这个框架里面和不用数据库相同

# -*- coding: utf-8 -*-

import scrapy

from yang_guan.items import YangGuanItem

from copy import deepcopy

from scrapy.spiders import CrawlSpider

class YgSpider(scrapy.Spider):

    name = 'yg'

    allowed_domains = ['huanqiu.com']

    start_urls = ['http://www.huanqiu.com/',

                  ]

    def parse (self, response):  # 总页面  第一个一定要用parse  用来传递start_urls

        item = YangGuanItem()

        # item = {}

        class_news_urls_li = response.xpath(".//div[@class='navCon']/ul/li/a")

        print(class_news_urls_li)

        for class_news_url in class_news_urls_li:

            item["class_tittle"] = class_news_url.xpath("./text()").extract_first()

            print(item)

            new_url = class_news_url.xpath("./@href").extract_first()

            print(new_url)

            yield scrapy.Request(

                new_url,

                callback=self.second_class,

                meta={"item": deepcopy(item)},  # 由于是多线程 所以要用深拷贝进入item

            )

    def second_class(self, response):  # 二级页面

        item = response.meta["item"]

        print(response.url)

        second_urls = response.xpath(".//div/h2/em")

        for second_url in second_urls:

            secoond_news_url = second_url.xpath("./a/@href").extract_first()

            yield scrapy.Request(

                secoond_news_url,

                callback=self.parse_detail_analyze,

                meta={"item": deepcopy(item)}

            )

    def parse_detail_analyze(self, response):  # 进入第三成  总细节的抓取  http://china.huanqiu.com/leaders/'

        item = response.meta["item"]

        li_list = response.xpath("//ul[@class='listPicBox']/li")

        for li in li_list:

            # item = YangGuanItem()

            item["title"] = li.xpath("./h3/a/text()").extract_first()

            item["img_url"] = li.xpath("./a/img/@src").extract_first()

            item["detail"] = li.xpath("./h5/text()").extract_first()

            yield item

        next_url = response.xpath(".//div[@class='pageBox']/div/a[last()]/@href").extract_first()  # 遇见翻页就要这样写

        yield scrapy.Request(next_url, callback=self.parse_detail_analyze,meta={"item":response.meta["item"]})

关于 pipelines 的管道设定

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

class YangGuanPipeline(object):

    def __init__(self):

        # 建立mongodb 数据库连接

        client = pymongo.MongoClient('127.0.0.1', 27017)

        # 连接数据库,['scrapy_huan_qiu]

        db = client['scrapy_huan_qiu']

        # 连接所用的集合

        self.post = db['zong_huan_qiu']

        print("*"*100)

    def process_item(self, item, spider):

        postItem = dict(item)

        self.post.insert(postItem)

        return item

setting 的设置

# -*- coding: utf-8 -*-

# Scrapy settings for yang_guan project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#  好像记得是ip代理

PROXIES = [

    {'ip_port': '111.11.228.75:80', 'user_pass': ''},

    {'ip_port': '120.198.243.22:80', 'user_pass': ''},

    {'ip_port': '111.8.60.9:8123', 'user_pass': ''},

    {'ip_port': '101.71.27.120:80', 'user_pass': ''},

    {'ip_port': '122.96.59.104:80', 'user_pass': ''},

    {'ip_port': '122.224.249.122:8088', 'user_pass': ''},]

BOT_NAME = 'yang_guan'

SPIDER_MODULES = ['yang_guan.spiders']

NEWSPIDER_MODULE = 'yang_guan.spiders'

# LOG_LEVEL = "WARNING"

# Crawl responsibly by identifying yourself (and your website) on the user-agent

# 计算机型号防止反爬虫

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'

# Obey robots.txt rules

# 不遵守爬虫机器人协议

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'yang_guan.middlewares.YangGuanSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'yang_guan.middlewares.YangGuanDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#  开启管道  由于这次没有编写 items  也无法保存进入数据库

ITEM_PIPELINES = {

   'yang_guan.pipelines.YangGuanPipeline': 300,

}

#  关于  debug等级  和生成log日志

# LOG_FILE = "dg.log"

# LOG_LEVEL = "DEBUG"

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

关于item 的设置这个一定要有用spider 里面的 yield 来进行传递字典

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class YangGuanItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    class_tittle = scrapy.Field()

    img_url = scrapy.Field()

    detail = scrapy.Field()

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class YangGuanItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    class_tittle = scrapy.Field()

    img_url = scrapy.Field()

    detail = scrapy.Field()

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class YangGuanItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    class_tittle = scrapy.Field()

    img_url = scrapy.Field()

    detail = scrapy.Field()

19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)的更多相关文章

scrapy框架基于CrawlSpider的全站数据爬取
引入提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
爬虫入门之Scrapy框架基础框架结构及腾讯爬取(十)
Scrapy终端是一个交互终端,我们可以在未启动spider的情况下尝试及调试代码,也可以用来测试XPath或CSS表达式,查看他们的工作方式,方便我们爬取的网页中提取的数据. 如果安装了 IPyth ...
Scrapy实战篇（五）之爬取历史天气数据
本篇文章我们以抓取历史天气数据为例,简单说明数据抓取的两种方式: 1.一般简单或者较小量的数据需求,我们以requests(selenum)+beautiful的方式抓取数据 2.当我们需要的数据量较 ...
Scrapy实战篇（八）之爬取教育部高校名单抓取和分析
本节我们以网址https://daxue.eol.cn/mingdan.shtml为初始链接,爬取教育部公布的正规高校名单. 思路: 1.首先以上面的地址开始链接,抓取到下面省份对应的链接. 2.在解 ...
13，scrapy框架的日志等级和请求传参
今日概要日志等级请求传参如何提高scrapy的爬取效率一.Scrapy的日志等级 - 在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy ...
scrapy爬虫笔记(三)------写入源文件的爬取
开始爬取网页:(2)写入源文件的爬取为了使代码易于修改,更清晰高效的爬取网页,我们将代码写入源文件进行爬取. 主要分为以下几个步骤: 一.使用scrapy创建爬虫框架: 二.修改并编写源代码,确定我 ...
Scrapy实战篇（七）之爬取爱基金网站基金业绩数据
本篇我们以scrapy+selelum的方式来爬取爱基金网站(http://fund.10jqka.com.cn/datacenter/jz/)的基金业绩数据. 思路:我们以http://fund.1 ...
Scrapy实战篇（六）之爬取360图片数据和图片
本篇文章我们以360图片为例,介绍scrapy框架的使用以及图片数据的下载. 目标网站:http://images.so.com/z?ch=photography 思路:分析目标网站为ajax加载方式 ...
Scrapy实战篇（三）之爬取豆瓣电影短评
今天的主要内容是爬取豆瓣电影短评,看一下网友是怎么评价最近的电影的,方便我们以后的分析,以以下三部电影:二十二,战狼,三生三世十里桃花为例. 由于豆瓣短评网页比较简单,且不存在动态加载的内容,我们下面 ...

随机推荐

ubuntu 解压命令全览
.tar解包:tar xvf FileName.tar打包:tar cvf FileName.tar DirName(注:tar是打包,不是压缩!)-------------------------- ...
Java日期时间API系列11-----Jdk8中java.time包中的新的日期时间API类，使用java8日期时间API重写农历LunarDate
通过Java日期时间API系列7-----Jdk8中java.time包中的新的日期时间API类的优点,java8具有很多优点,现在网上查到的农历转换工具类都是基于jdk7及以前的类写的,下面使用ja ...
133、Java获取main主函数参数
01.代码如下: package TIANPAN; /** * 此处为文档注释 * * @author 田攀微信382477247 */ public class TestDemo { public ...
jmeter学习笔记一foreach控制器
ForEach控制器输入变量前缀:上一步所提取的变量名的前缀,例如appid_1, 则appid就是前缀 start index for loop:循环的起始位置,默认为空也可 end index ...
浅谈脱壳中的附加数据问题（overlay）
Author:Lenus -------------------------------------------------- 1.前言最近,在论坛上看到很多人在弄附加数据overlay的问题,加上 ...
吴裕雄--天生自然JAVAIO操作学习笔记：单人信息管理程序
import java.io.* ; public class ExecDemo03{ public static void main(String args[]) throws Exception{ ...
第1节 storm编程：4、storm环境安装以及storm编程模型介绍
dataSource:数据源,生产数据的东西 spout:接收数据源过来的数据,然后将数据往下游发送 bolt:数据的处理逻辑单元.可以有很多个,基本上每个bolt都处理一部分工作,然后将数据继续往下 ...
如何在cmd中连接数据库
数据库连接时遇到的问题 : https://www.cnblogs.com/xyzdw/archive/2011/08/11/2135227.htmlping +ip地址: 查看本机ip:ipconf ...
redis有序集合-zset
概念:它是在set的基础上增加了一个顺序属性,这一属性在添加修改元素的时候可以指定,每次指定后,zset会自动按新的值调整顺序.可以理解为有两列的mysql表,一列存储value,一列存储顺序,操作中 ...
php 实现店铺装修4
/** * @title 发布装修的店铺 * @example FlagShipShopDecorate.fabu? 调试参数:{"username":"17721355 ...

19 03 13 关于 scrapy 框架的 对环球网的整体爬取(存储于 mongodb 数据库里)

19 03 13 关于 scrapy 框架的 对环球网的整体爬取(存储于 mongodb 数据库里)的更多相关文章

随机推荐

热门专题

19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)

19 03 13 关于 scrapy 框架的对环球网的整体爬取(存储于 mongodb 数据库里)的更多相关文章