第五课主要内容有：

Scrapy框架结构，组件及工作方式
单页爬取-julyedu.com
拼URL爬取-博客园
循环下页方式爬取-toscrape.com
Scrapy项目相关命令-QQ新闻

1.Scrapy框架结构，组件及工作方式

2.单页爬取-julyedu.com

#by 寒小阳(hanxiaoyang.ml@gmail.com)---七月在线讲师

#Python2

import scrapy

class JulyeduSpider(scrapy.Spider):

    name = "julyedu"

    start_urls = [

        'https://www.julyedu.com/category/index',

    ]

    def parse(self, response):

        for julyedu_class in response.xpath('//div[@class="course_info_box"]'):

            print julyedu_class.xpath('a/h4/text()').extract_first()

            print julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first()

            print julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first()

            print response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())

            print "\n"

            yield {

                'title':julyedu_class.xpath('a/h4/text()').extract_first(),

                'desc': julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first(),

                'time': julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first(),

                'img_url': response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())

            }

3.拼URL爬取-博客园

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy

class CnBlogSpider(scrapy.Spider):

    name = "cnblogs"

    allowed_domains = ["cnblogs.com"]

    start_urls = [

        'http://www.cnblogs.com/pick/#p%s' % p for p in xrange(1, 11)

        ]

    def parse(self, response):

        for article in response.xpath('//div[@class="post_item"]'):

            print article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip()

            print response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip()

            print article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip()

            print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip()

            print response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip()

            print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip()

            print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip()

            print ""

            yield {

                'title': article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip(),

                'link': response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip(),

                'summary': article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip(),

                'author': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip(),

                'author_link': response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip(),

                'comment': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip(),

                'view': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip(),

            }

4.找到‘下一页’标签进行爬取

import scrapy

class QuotesSpider(scrapy.Spider):

    name = "quotes"

    start_urls = [

        'http://quotes.toscrape.com/tag/humor/',

    ]

    def parse(self, response):

        for quote in response.xpath('//div[@class="quote"]'):

            yield {

                'text': quote.xpath('span[@class="text"]/text()').extract_first(),

                'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),

            }

        next_page = response.xpath('//li[@class="next"]/@herf').extract_first()

        if next_page is not None:

            next_page = response.urljoin(next_page)

            yield scrapy.Request(next_page, callback=self.parse)

5.进入链接，按照链接进行爬取

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy

class QQNewsSpider(scrapy.Spider):

    name = 'qqnews'

    start_urls = ['http://news.qq.com/society_index.shtml']

    def parse(self, response):

        for href in response.xpath('//*[@id="news"]/div/div/div/div/em/a/@href'):

            full_url = response.urljoin(href.extract())

            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):

        print response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first()

        print response.xpath('//span[@class="a_time"]/text()').extract_first()

        print response.xpath('//span[@class="a_catalog"]/a/text()').extract_first()

        print "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())

        print ""

        yield {

            'title': response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first(),

            'content': "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract()),

            'time': response.xpath('//span[@class="a_time"]/text()').extract_first(),

            'cate': response.xpath('//span[@class="a_catalog"]/a/text()').extract_first(),

        }

七月在线爬虫班学习笔记（五）——scrapy spider的几种爬取方式的更多相关文章

七月在线爬虫班学习笔记（六）——scrapy爬虫整体示例
第六课主要内容: 爬豆瓣文本例程 douban 图片例程 douban_imgs 1.爬豆瓣文本例程 douban 目录结构 douban --douban --spiders --__init__. ...
七月在线爬虫班学习笔记（二）——Python基本语法及面向对象
第二课主要内容如下: 代码格式基本语法关键字循环判断函数容器面向对象文件读写多线程错误处理代码格式 syntax基本语法 a = 1234 print(a) a = 'abcd' ...
【学习笔记】Python 3.6模拟输入并爬取百度前10页密切相关链接
[学习笔记]Python 3.6模拟输入并爬取百度前10页密切相关链接问题描述通过模拟网页,实现百度搜索关键词,然后获得网页中链接的文本,与准备的文本进行比较,如果有相似之处则代表相关链接. me ...
Dynamic CRM 2013学习笔记（十）客户端几种查询数据方式比较
我们经常要在客户端进行数据查询,下面分别比较常用的几种查询方式:XMLHttpRequest, SDK.JQuery, SDK.Rest. XMLHttpRequest是最基本的调用方式,JQuery ...
(3)分布式下的爬虫Scrapy应该如何做-递归爬取方式，数据输出方式以及数据库链接
放假这段时间好好的思考了一下关于Scrapy的一些常用操作,主要解决了三个问题: 1.如何连续爬取 2.数据输出方式 3.数据库链接一,如何连续爬取: 思考:要达到连续爬取,逻辑上无非从以下的方向着 ...
scrapy爬虫框架学习笔记(一)
scrapy爬虫框架学习笔记(一) 1.安装scrapy pip install scrapy 2.新建工程: (1)打开命令行模式 (2)进入要新建工程的目录 (3)运行命令: scrapy sta ...
Scrapy:学习笔记(2)——Scrapy项目
Scrapy:学习笔记(2)——Scrapy项目 1.创建项目创建一个Scrapy项目,并将其命名为“demo” scrapy startproject demo cd demo 稍等片刻后,Scr ...
go微服务框架kratos学习笔记五(kratos 配置中心 paladin config sdk [断剑重铸之日，骑士归来之时])
目录 go微服务框架kratos学习笔记五(kratos 配置中心 paladin config sdk [断剑重铸之日,骑士归来之时]) 静态配置 flag注入在线热加载配置远程配置中心 go微 ...
C#可扩展编程之MEF学习笔记(五)：MEF高级进阶
好久没有写博客了,今天抽空继续写MEF系列的文章.有园友提出这种系列的文章要做个目录,看起来方便,所以就抽空做了一个,放到每篇文章的最后. 前面四篇讲了MEF的基础知识,学完了前四篇,MEF中比较常用 ...

随机推荐

Redhat终端中文乱码解决
文件中的中文以及命令反馈的中文能够正常显示,但是在终端中用ls等命令查看文件时会出现乱码. 我在i18n文件中加了下面两行内容(本来只有第一行),后来就能正常显示了.
python-文字转语音-pyttsx3
pyttsx3 python 文字转语音库,支持英文,中文,可以调节语速.语调等. 安装 pip install pyttsx3 示例 import pyttsx3 teacher = pyttsx3 ...
常量（constant）
在java语言中,主要用final来定义一个常量.常量一旦被初始化不能更改其值. 常量:大写字母和下划线:MAX_VALUE final double PI = 3.14; PI = 3.15;//编 ...
python小游戏
import time,random # 需要的数据和变量放在开头player_list = ['[狂血战士]','[森林箭手]','[光明骑士]','[独行剑客]','[格斗大师]','[枪弹专家] ...
2019 To do List
做好测试不是靠编程,而是靠的是严禁的作风,慎密的逻辑思维,适合的测试流程. 内心有些迷茫的时候,迷茫的是作为测试既然要学那么多编程,为什么不直接去干开发呢?学了编程,用不上,到底有什么用呢? 看了这句 ...
CentOS6.9切换root用户su root输入正确密码后一直提示Incorrect password，如何解决？
su是切换用户命令,su root时,输入正确的root命令,却提示Incorrect password,当前用户为普通用户,遇到此问题该如何解决呢? 如果设置了wheel组,使用su root命令是 ...
使用Python编的猜数字小游戏
import random secret = random.randint(1, 30) guess = 0 tries = 0 print("我叫丁丁,我有一个秘密数字!") p ...
请为CMyString类型编写构造函数、copy构造函数、析构函数和赋值运算符函数。
如下为类型CMyString的声明,请为该类型编写构造函数.copy构造函数.析构函数和赋值运算符函数. class CMyString { public: CMyString(const char* ...
1.1 Django起步
1.1 Django起步 1.1.1. Django简介 Django开发框架(简称Django)诞生的时间是2003年的金秋时节,美国有两位程序员Adrian Holovaty和Simon ...
conda环境复制
配置环境是一个很烦的事,有时候用到服务器需要一遍又一遍的配..太麻烦了,这时候就要用到conda,直接复制已有的环境.事半功倍. 第一种方法:地址复制首先找到要复制的环境的路径:conda info ...

七月在线爬虫班学习笔记（五）——scrapy spider的几种爬取方式

第五课主要内容有：

1.Scrapy框架结构，组件及工作方式

2.单页爬取-julyedu.com

3.拼URL爬取-博客园

4.找到‘下一页’标签进行爬取

5.进入链接，按照链接进行爬取

七月在线爬虫班学习笔记（五）——scrapy spider的几种爬取方式的更多相关文章

随机推荐

热门专题