scrapy实战--爬取报刊名称及地址

目标：爬取全国报刊名称及地址

链接：http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm

目的：练习scrapy爬取数据

学习过scrapy的基本使用方法后，我们开始写一个最简单的爬虫吧。

目标截图：

　　1、创建爬虫工程

$ cd ~/code/crawler/scrapyProject

$ scrapy startproject newSpapers

　　2、创建爬虫程序

$ cd newSpapers/

$ scrapy genspider nationalNewspaper news.xinhuanet.com

　　3、配置数据爬取项　

$ cat items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class NewspapersItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    name = scrapy.Field()

    addr = scrapy.Field()

　4、　配置爬虫程序

$ cat spiders/nationalNewspaper.py

# -*- coding: utf-8 -*-

import scrapy

from newSpapers.items import NewspapersItem

class NationalnewspaperSpider(scrapy.Spider):

    name = "nationalNewspaper"

    allowed_domains = ["news.xinhuanet.com"]

    start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm']

    def parse(self, response):

        sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]')

        sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]')

        tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a')

        items = []

        for each in tags_a_country:

            item = NewspapersItem()

            item['name'] = each.xpath('./strong/text()').extract()

            item['addr'] = each.xpath('./@href').extract()

            items.append(item)

        return items

　　5、配置谁去处理爬取结果

$ cat settings.py

……

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100}

　　6、配置数据处理程序

$ cat pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import time

class NewspapersPipeline(object):

    def process_item(self, item, spider):

        now = time.strftime('%Y-%m-%d',time.localtime())

        filename = 'newspaper.txt'

        print '================='

        print item

        print '================'

        with open(filename,'a') as fp:

            fp.write(item['name'][0].encode("utf8")+ '\t' +item['addr'][0].encode("utf8") + '\n')

        return item

　　7、查看结果

$ cat spiders/newspaper.txt

人民日报	http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm

海外版	http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm

光明日报	http://www.gmw.cn/01gmrb/2007-09/20/default.htm

经济日报	http://www.economicdaily.com.cn/no1/

解放军报	http://www.gmw.cn/01gmrb/2007-09/20/default.htm

中国日报	http://pub1.chinadaily.com.cn/cdpdf/cndy/

程序源代码：

scrapy实战--爬取报刊名称及地址的更多相关文章

简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...
教程+资源,python scrapy实战爬取知乎最性感妹子的爆照合集(12G)!
一.出发点: 之前在知乎看到一位大牛(二胖)写的一篇文章:python爬取知乎最受欢迎的妹子(大概题目是这个,具体记不清了),但是这位二胖哥没有给出源码,而我也没用过python,正好顺便学一学,所以 ...
scrapy实战--爬取最新美剧
现在写一个利用scrapy爬虫框架爬取最新美剧的项目. 准备工作: 目标地址:http://www.meijutt.com/new100.html 爬取项目:美剧名称.状态.电视台.更新时间 1.创建 ...
Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)
1. Scrapy框架 Scrapy是python下实现爬虫功能的框架,能够将数据解析.数据处理.数据存储合为一体功能的爬虫框架. 2. Scrapy安装 1. 安装依赖包 yum install g ...
Scrapy实战篇（六）之Scrapy配合Selenium爬取京东信息（上）
在之前的一篇实战之中,我们已经爬取过京东商城的文胸数据,但是前面的那一篇其实是有一个缺陷的,不知道你看出来没有,下面就来详细的说明和解决这个缺陷. 我们在京东搜索页面输入关键字进行搜索的时候,页面的返 ...
scrapy实战2分布式爬取lagou招聘（加入了免费的User-Agent随机动态获取库 fake-useragent 使用方法查看：https://github.com/hellysmile/fake-useragent）
items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentati ...
scrapy框架爬取笔趣阁完整版
继续上一篇,这一次的爬取了小说内容 pipelines.py import csv class ScrapytestPipeline(object): # 爬虫文件中提取数据的方法每yield一次it ...
scrapy框架爬取笔趣阁
笔趣阁是很好爬的网站了,这里简单爬取了全部小说链接和每本的全部章节链接,还想爬取章节内容在biquge.py里在加一个爬取循环,在pipelines.py添加保存函数即可 1 创建一个scrapy项目 ...
Python分布式爬虫开发搜索引擎 Scrapy实战视频教程
点击了解更多Python课程>>> Python分布式爬虫开发搜索引擎 Scrapy实战视频教程课程目录 |--第01集教程推介 98.23MB |--第02集 windows下 ...

随机推荐

hbase copyTable
参考:https://yq.aliyun.com/articles/176546 执行:hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new. ...
springweb flux 编程模型
Spring WebFlux 编程模型是在spring5.0开始,springbot2.0版本设计出来的新的一种反应式变成模型.它脱胎于reactor模式,是java nio 异步编程模型. 传统一般 ...
spring AOP 之四：@AspectJ切入点标识符语法详解
@AspectJ相关文章 <spring AOP 之二:@AspectJ注解的3种配置> <spring AOP 之三:使用@AspectJ定义切入点> <spring ...
MVC中分页的实现
我在 (www.helpqy.com) 中使用了下面的分页技术. 分页可以采用troygoode提供的开源包,其开源网站主页为:https://github.com/TroyGoode/PagedLi ...
ionic中generate page后module.ts报错的解决办法
此问题出现在Ionic官方将版本从2.2升级到Ionic3以上之后, 在项目中generate page时,自动创建的module.ts就报错,如下: 解决办法如下: 1)将IonicModule替换 ...
C#，一些非常简单但应该知道的知识点
1.本地变量一看这个标题你可能会一愣,这是个什么东东.看个小例子: static void main(){ int a=10; MyClass mc=new MyClass();} 呵呵,这 ...
SpringBoot数据库访问(一)--------关系型数据库访问（RDBMS）
关系型数据库访问(RDBMS) 采用JdbcTemplate.MyBatis.JPA.Hibernate等技术. 一.JdbcTemplate工具在pom.xml添加boot-starter-jdb ...
修改MVC视图默认搜索规则（IViewEngine）
前几天我自己在写一个系统,写到后台管理系统的时候,我突然有个想法就是:想在区域视图下新建文件,单独处理后台一些业务:Area/AdminManager/View/Content/Index.cshtm ...
关于伪分布zookeeper集群启动出错（Error contacting service. It is probably not running.）
今天在配置zookeeper伪分布集群的时候,发现竟然出错了,以前我都是在多台电脑上搭建,大家可以参考我写的Hadoop HA搭建中的zookeeper如何搭建现在就来说一下为何会出错. 出错的原因 ...
Python Django Ajax 传递列表数据
function getTableContent(node) { event.preventDefault(); var tr = node.parentNode.parentNode; var id ...

scrapy实战--爬取报刊名称及地址

scrapy实战--爬取报刊名称及地址的更多相关文章

随机推荐

热门专题