scrapy 中crawlspider 爬虫

爬取目标网站：

http://www.chinanews.com/rss/rss_2.html

获取url后进入另一个页面进行数据提取

检查网页：

爬虫该页数据的逻辑：

Crawlspider爬虫类：

# -*- coding: utf-8 -*-

import scrapy

import re

#from scrapy import Selector

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

class NwSpider(CrawlSpider):

    name = 'nw'

    # allowed_domains = ['www.new.com']

    start_urls = ['http://www.chinanews.com/rss/rss_2.html']

    rules = (
　　　　　

        Rule(LinkExtractor(allow='http://www.chinanews.com/rss/.*?\.xml'), callback='parse_item'),

    )

    def parse_item(self, response):

        selector = Selector(response)

        items =response.xpath('//item').extract()

        for node in items:

            # print(type(node))

            #

            item = {}

            item['title'] = re.findall(r'<title>(.*?)</title>',node,re.S)[0]

            item['link'] = re.findall(r'<link>(.*?)</link>',node,re.S)[0]

            item['desc'] = re.findall(r'<description>(.*?)</description>',node,re.S)[0]

            item['pub_date'] =re.findall(r'<pubDate>(.*?)</pubDate>',node,re.S)[0]

            print(item)

        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()

        #item['name'] = response.xpath('//div[@id="name"]').get()

        #item['description'] = response.xpath('//div[@id="description"]').get()

            # yield item

scrapy 中crawlspider 爬虫的更多相关文章

python框架Scrapy中crawlSpider的使用——爬取内容写进MySQL
一.先在MySQL中创建test数据库,和相应的site数据表二.创建Scrapy工程 #scrapy startproject 工程名 scrapy startproject demo4 三.进入 ...
python框架Scrapy中crawlSpider的使用
一.创建Scrapy工程 #scrapy startproject 工程名 scrapy startproject demo3 二.进入工程目录,根据爬虫模板生成爬虫文件 #scrapy genspi ...
scrapy中运行爬虫时出现twisted critical unhandled error错误
1. 试试这条命令: twisted critical unhandled error on scrapy tutorial python python27\scripts\pywin32_posti ...
爬虫07 /scrapy图片爬取、中间件、selenium在scrapy中的应用、CrawlSpider、分布式、增量式
爬虫07 /scrapy图片爬取.中间件.selenium在scrapy中的应用.CrawlSpider.分布式.增量式目录爬虫07 /scrapy图片爬取.中间件.selenium在scrapy ...
scrapy进阶（CrawlSpider爬虫__爬取整站小说）
# -*- coding: utf-8 -*- import scrapy,re from scrapy.linkextractors import LinkExtractor from scrapy ...
Scrapy - CrawlSpider爬虫
crawlSpider 爬虫思路: 从response中提取满足某个条件的url地址,发送给引擎,同时能够指定callback函数. 1. 创建项目 scrapy startproject mysp ...
python爬虫之Scrapy框架(CrawlSpider)
提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...
第三百五十一节，Python分布式爬虫打造搜索引擎Scrapy精讲—将selenium操作谷歌浏览器集成到scrapy中
第三百五十一节,Python分布式爬虫打造搜索引擎Scrapy精讲—将selenium操作谷歌浏览器集成到scrapy中 1.爬虫文件 dispatcher.connect()信号分发器,第一个参数信 ...
爬虫开发12.selenium在scrapy中的应用
selenium在scrapy中的应用阅读量: 370 1 引入在通过scrapy框架进行某些网站数据爬取的时候,往往会碰到页面动态数据加载的情况发生,如果直接使用scrapy对其url发请求,是绝 ...

随机推荐

asp.net无限递归
private void button1_Click(object sender, EventArgs e) { DialogResult dialogResult = folderBrowserDi ...
C# 链表反转
链表反转分这么两种情况, 一种是链表头节点始终前置,那这时候需要传一个头节点特有的标记:(简称:头不转) HEAD->Test1->Test2->Test3->Test4 反转 ...
linux 命令创建 Django 项目使用路由返回首页界面
1.安装mysql数据库 2.安装pymysql.pip3 install pymysql 3.首先使用cd 命令进入创建的项目文件夹 4.使用django-admin startproject we ...
数据迁移时提示 No changes detected
1.删除数据库中django_migrations 中对应的信息 2.删除app下的migrations对应的文件 3.重新执行就可成功如不成功 ,直接删库 ,重新迁移
Git命令（Git版本：Linux 2.14.3）
常用 git status 跟踪状态git commit -m "xxx" yyy.cppgit pull git pushgit mergetool --tool=meld 合并 ...
Vue系列之 => html-webpack-plugin的两个基本作用
安装 npm i html-webpack-plugin -D webpack.config.js const path = require('path'); //启用热更新的第二步,导入webpac ...
零门槛，包教会。让你在5分钟内使用以太坊ERC20智能合约发行属于自己的空气币
前言目前区块链是互联网中最最火的风口,没有之一.我周围的很多朋友也加入了“炒币”行列,但很不幸,几乎都被“割韭菜”了.而经过我的几天研究,发现,如果自己要发行一种空气币,简直太简单了.只需要下面几个 ...
C# 截取两个字符串中间的子字符串
/// <summary> /// 截取中间字符 /// </summary> /// <param name="text">全字符串</ ...
week1 - Python基础1 介绍、基本语法、流程控制
知识内容: 1.python介绍 2.变量及输入输出 3.分支结构 4.循环结构一.python介绍 Python主要应用领域: 云计算: 云计算最火的语言, 典型应用OpenStack WEB开发 ...
springboot整合mybatis（使用MyBatis Generator)
引入依赖 <dependencies> <dependency> <groupId>org.springframework.boot</groupId> ...

scrapy 中crawlspider 爬虫

scrapy 中crawlspider 爬虫的更多相关文章

随机推荐

热门专题