爬虫框架Scrapy之案例一

阳光热线问政平台

http://wz.sun0769.com/index.php/question/questionType?type=4

爬取投诉帖子的编号、帖子的url、帖子的标题，和帖子里的内容。

items.py

import scrapy

class SunwzItem(scrapy.Item):

    number = scrapy.Field()

    url = scrapy.Field()

    title = scrapy.Field()

    content = scrapy.Field()

spiders/sunwz.py



# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from Sunwz.items import SunwzItem

class SunwzSpider(CrawlSpider):

    name = 'sunwz'

    num = 0

    allow_domain = ['http://wz.sun0769.com/']

    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4']

    rules = {

        Rule(LinkExtractor(allow='page')),

        Rule(LinkExtractor(allow='/index\.php/question/questionType\?type=4$')),

        Rule(LinkExtractor(allow='/html/question/\d+/\d+\.shtml$'), follow = True, callback='parse_content')

    }

    xpathDict = {

        'title': '//div[contains(@class, "pagecenter p3")]/div/div/div[contains(@class,"cleft")]/strong/text()',

        'content': '//div[contains(@class, "c1 text14_2")]/text()',

        'content_first': '//div[contains(@class, "contentext")]/text()'

    }

    def parse_content(self, response):

        item = SunwzItem()

        content = response.xpath(self.xpathDict['content_first']).extract()

        if len(content) == 0:

            content = response.xpath(self.xpathDict['content']).extract()[0]

        else:

            content = content[0]

        title = response.xpath(self.xpathDict['title']).extract()[0]

        title_list = title.split(' ')

        number = title_list[-1]

        number = number.split(':')[-1]

        url = response.url

        item['url'] = url

        item['number'] = number

        item['title'] = title

        item['content'] = content

        yield item

pipelines.py

import json

import codecs

class JsonWriterPipeline(object):

    def __init__(self):

        self.file = codecs.open('sunwz.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        line = json.dumps(dict(item), ensure_ascii=False) + "\n"

        self.file.write(line)

        return item

    def spider_closed(self, spider):

        self.file.close()

settings.py

ITEM_PIPELINES = {

    'Sunwz.pipelines.JsonWriterPipeline': 300,

}

在项目根目录下新建main.py文件,用于调试

from scrapy import cmdline

cmdline.execute('scrapy crawl sunwz'.split())

执行程序

py2 main.py

爬虫框架Scrapy之案例一的更多相关文章

爬虫框架Scrapy之案例二
新浪网分类资讯爬虫爬取新浪网导航页所有下所有大类.小类.小类里的子链接,以及子链接页面的新闻内容. 效果演示图: items.py import scrapy import sys reload(s ...
爬虫框架Scrapy之案例三图片下载器
items.py class CoserItem(scrapy.Item): url = scrapy.Field() name = scrapy.Field() info = scrapy.Fiel ...
教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...
【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
爬虫框架Scrapy
前面十章爬虫笔记陆陆续续记录了一些简单的Python爬虫知识, 用来解决简单的贴吧下载,绩点运算自然不在话下. 不过要想批量下载大量的内容,比如知乎的所有的问答,那便显得游刃不有余了点. 于是乎,爬虫 ...
第三篇：爬虫框架 - Scrapy
前言 Python提供了一个比较实用的爬虫框架 - Scrapy.在这个框架下只要定制好指定的几个模块,就能实现一个爬虫. 本文将讲解Scrapy框架的基本体系结构,以及使用这个框架定制爬虫的具体步骤 ...
网络爬虫框架Scrapy简介
作者: 黄进(QQ:7149101) 一. 网络爬虫网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本:它是一个自动提取网页的程序,它为搜索引擎从万维 ...
Linux 安装python爬虫框架 scrapy
Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...
Python爬虫框架Scrapy实例（三）数据存储到MongoDB
Python爬虫框架Scrapy实例(三)数据存储到MongoDB任务目标:爬取豆瓣电影top250,将数据存储到MongoDB中. items.py文件复制代码# -*- coding: utf-8 ...

随机推荐

Jumpserver使用
堡垒机介绍在一个特定网络环境下,为了保障网络和数据不受外界入侵和破坏,而运用各种技术手段实时收集和监控网络环境中每一个组成部分的系统状态.安全事件.网络活动,以便集中报警.及时处理及审计定责. 我们 ...
post 传递参数中包含 html 代码解决办法，js加密，.net解密
今天遇到一个问题,就是用post方式传递参数,程序在vs中完美调试,但是在iis中,就无法运行了,显示传递的参数获取不到,报错了,查看浏览器请求情况,错误500,服务器内部错误,当时第一想法是接收方式 ...
兼容获取scrollTop和scrollLeft(被滚动条卷走的部分)
function scroll() { //ie9+ 标准浏览器 if (window.pageYOffset != null) { return { left: window.pageXOffset ...
过程记录：搭建wordpress站点
过程记录:搭建wordpress站点前提:现在aws中搭建好LNAMP环境和网络mysql数据库,即为下载的wdcp和aws的rds 1.获取WordPress安装包(中文版) https://cn ...
ReSharper Ultimate 2017 下载地址及破解方法
https://download.jetbrains.8686c.com/resharper/JetBrains.ReSharperUltimate.2017.1.2.exe 安装完成后,打开vs ...
【HTML5 localStorage本地储存】简介&基本语法
了解localStorage localStorage是最新的HTML5中的新技术,它主要是用于本地储存.最近看了看localStorage,发现比cookie好多用了,还比cookie简单多了.于是 ...
Python数据库连接池实例——PooledDB
不用连接池的MySQL连接方法 import MySQLdb conn= MySQLdb.connect(host='localhost',user='root',passwd='pwd',db='m ...
Wilcoxon符号秩+秩和检验学习[转载]
参数检验就是已知数据的精确分布模型,根据数据来求出模型中的未知参数:而非参数检验就是无需对样本总体分布(比如满足正态分布)做出假设. 1.符号检验转自:https://baike.baidu.com ...
robotFramework_ride_python2_Wxpython测试环境搭建
(提示:我的安装版本是robotFramework3.0+ride1.5+python2.7+wxpython2.8,至于wxpython3.0下ride安装打不开的问题我还没找到原因,建议刚开始先不 ...
Django 分页查询并返回jsons数据，中文乱码解决方法
Django 分页查询并返回jsons数据,中文乱码解决方法一.引子 Django 分页查询并返回 json ,需要将返回的 queryset 序列化, demo 如下: # coding=UTF- ...

爬虫框架Scrapy之案例一

阳光热线问政平台

items.py

spiders/sunwz.py

pipelines.py

settings.py

在项目根目录下新建main.py文件,用于调试

执行程序

爬虫框架Scrapy之案例一的更多相关文章

随机推荐

热门专题