爬虫框架Scrapy之案例一

阳光热线问政平台

http://wz.sun0769.com/index.php/question/questionType?type=4

爬取投诉帖子的编号、帖子的url、帖子的标题，和帖子里的内容。

items.py

import scrapy

class SunwzItem(scrapy.Item):

    number = scrapy.Field()

    url = scrapy.Field()

    title = scrapy.Field()

    content = scrapy.Field()

spiders/sunwz.py



# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from Sunwz.items import SunwzItem

class SunwzSpider(CrawlSpider):

    name = 'sunwz'

    num = 0

    allow_domain = ['http://wz.sun0769.com/']

    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4']

    rules = {

        Rule(LinkExtractor(allow='page')),

        Rule(LinkExtractor(allow='/index\.php/question/questionType\?type=4$')),

        Rule(LinkExtractor(allow='/html/question/\d+/\d+\.shtml$'), follow = True, callback='parse_content')

    }

    xpathDict = {

        'title': '//div[contains(@class, "pagecenter p3")]/div/div/div[contains(@class,"cleft")]/strong/text()',

        'content': '//div[contains(@class, "c1 text14_2")]/text()',

        'content_first': '//div[contains(@class, "contentext")]/text()'

    }

    def parse_content(self, response):

        item = SunwzItem()

        content = response.xpath(self.xpathDict['content_first']).extract()

        if len(content) == 0:

            content = response.xpath(self.xpathDict['content']).extract()[0]

        else:

            content = content[0]

        title = response.xpath(self.xpathDict['title']).extract()[0]

        title_list = title.split(' ')

        number = title_list[-1]

        number = number.split(':')[-1]

        url = response.url

        item['url'] = url

        item['number'] = number

        item['title'] = title

        item['content'] = content

        yield item

pipelines.py

import json

import codecs

class JsonWriterPipeline(object):

    def __init__(self):

        self.file = codecs.open('sunwz.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        line = json.dumps(dict(item), ensure_ascii=False) + "\n"

        self.file.write(line)

        return item

    def spider_closed(self, spider):

        self.file.close()

settings.py

ITEM_PIPELINES = {

    'Sunwz.pipelines.JsonWriterPipeline': 300,

}

在项目根目录下新建main.py文件,用于调试

from scrapy import cmdline

cmdline.execute('scrapy crawl sunwz'.split())

执行程序

py2 main.py

爬虫框架Scrapy之案例一的更多相关文章

爬虫框架Scrapy之案例二
新浪网分类资讯爬虫爬取新浪网导航页所有下所有大类.小类.小类里的子链接,以及子链接页面的新闻内容. 效果演示图: items.py import scrapy import sys reload(s ...
爬虫框架Scrapy之案例三图片下载器
items.py class CoserItem(scrapy.Item): url = scrapy.Field() name = scrapy.Field() info = scrapy.Fiel ...
教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...
【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
爬虫框架Scrapy
前面十章爬虫笔记陆陆续续记录了一些简单的Python爬虫知识, 用来解决简单的贴吧下载,绩点运算自然不在话下. 不过要想批量下载大量的内容,比如知乎的所有的问答,那便显得游刃不有余了点. 于是乎,爬虫 ...
第三篇：爬虫框架 - Scrapy
前言 Python提供了一个比较实用的爬虫框架 - Scrapy.在这个框架下只要定制好指定的几个模块,就能实现一个爬虫. 本文将讲解Scrapy框架的基本体系结构,以及使用这个框架定制爬虫的具体步骤 ...
网络爬虫框架Scrapy简介
作者: 黄进(QQ:7149101) 一. 网络爬虫网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本:它是一个自动提取网页的程序,它为搜索引擎从万维 ...
Linux 安装python爬虫框架 scrapy
Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...
Python爬虫框架Scrapy实例（三）数据存储到MongoDB
Python爬虫框架Scrapy实例(三)数据存储到MongoDB任务目标:爬取豆瓣电影top250,将数据存储到MongoDB中. items.py文件复制代码# -*- coding: utf-8 ...

随机推荐

设计模式之Factory工厂模式
在上一章,模板模式中,我们在父类规定处理的流程,在子类中实现具体的处理.如果我们将该模式用于生成实例,便演变成了Factory模式,即工厂模式. 在Factory模式中,父类决定实例的生成方式,但并不 ...
Apache 配置ArcGIS server/portal 反向代理
背景处于安全,负载均衡,访问加速等原因会对服务器启用反向代理.arcgis for server的默认的访问地址为http://server:6080/arcgis.这个时候端口和实例名都暴露了.可 ...
Junit 3.8.1 源码分析之两个接口
1. Junit源码文件说明 runner framework:整体框架; extensions:可以对程序进行扩展; textui:JUnit运行时的入口程序以及程序结果的呈现方式; awtui:J ...
类似于xml的一种数据传输格式protobuf
1.Protobuf 简介 Protocol Buffer是google 的一种数据交换的格式,已经在Github开源,目前最新版本是3.1.0.它独立于语言,独立于平台.google 提供了多种语言 ...
Notepad++ 更换主题+字体
Notepad++ 更换主题 https://blog.csdn.net/haluoluo211/article/details/51922666 延伸: 挑选主题 https://blog.csdn ...
（1.3.1）连接安全（连接实例与网络协议及TDS端点）
连接安全是sql server安全配置的第1道防线,它保证只有许可的客户端能够连接sql server,而且可以限制连接可用的通道(各种网络协议). 1.连接到sql server实例 sql ser ...
006-Shell printf 命令
一.概述 printf 命令模仿 C 程序库(library)里的 printf() 程序. printf 由 POSIX 标准所定义,因此使用 printf 的脚本比使用 echo 移植性好. pr ...
C语言中的extern
extern: 这个关键字真的比较恶心,在定义变量的时候,extern居然可以被省略(定义时,默认均省略): 在声明变量的时候,extern必须加在变量前. 所以有时候你搞不清楚是声明还是定义.:变量 ...
UVA10026：Shoemaker's Problem（贪心）
题目链接: http://acm.hust.edu.cn/vjudge/contest/view.action?cid=68990#problem/K 题目需求:鞋匠有n个任务,第i个任务要花费ti ...
python 之时间模块 time
time模块可以用于格式化日期和时间,时间间隔是以秒为单位的浮点小数.每个时间戳都以自从1970年1月1日午夜(历元)经过了多长时间来表示. 下面是time模块常用的一些时间格式转换的函数.时间戳可以 ...

爬虫框架Scrapy之案例一

阳光热线问政平台

items.py

spiders/sunwz.py

pipelines.py

settings.py

在项目根目录下新建main.py文件,用于调试

执行程序

爬虫框架Scrapy之案例一的更多相关文章

随机推荐

热门专题