py库： scrapy （深坑未填）

scrapy　　一个快速高级的屏幕爬取及网页采集框架

http://scrapy.org/　　官网

https://docs.scrapy.org/en/latest/　　Scrapy1.4文档

http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html　　Scrapy 0.24 中文文档

https://www.youtube.com/watch?v=cEBBG_5309c　　Scrapy爬虫框架教程02 Scrapy项目的基本使用　　2017-12-18

https://www.youtube.com/watch?v=0Uug1fDa8nw　　莫烦python　　2018-1-27（上次看到5分48秒）

https://www.youtube.com/watch?v=Wa2K7sB7BZE　　莫烦　　异步加载 Asyncio

https://github.com/MorvanZhou/easy-scraping-tutorial/tree/master/source_code　　莫烦的代码

安装：　　win7 安装 Scrapy：　　2017-10-19

当前环境：win7，python3.6.0，pyCharm4.5。 python目录是：c:/python3/

Scrapy依赖的库比较多，至少需要依赖库有Twisted 14.0，lxml 3.4，pyOpenSSL 0.14。

参考文章：http://www.cnblogs.com/liuliliuli2017/p/6746440.html 　　Python3环境安装Scrapy爬虫框架过程及常见错误

我在安装 Twisted 时遇到了问题。解决步骤如下：

1、http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted（重要：这个站点有非常多的whl文件！）　　到这里下载 . whl 文件

按说我机子是win764位的，本该用 Twisted-17.9.0-cp36-cp36m-win_amd64.whl，但是提示不让安装。只好瞎猫撞死耗子似的，又下载了 Twisted-17.9.0-cp36-cp36m-win32.whl 这个文件。把它放到 C:\Python3\Scripts\Twisted-17.9.0-cp36-cp36m-win32.whl

python pip3.exe install Twisted-17.9.0-cp36-cp36m-win32.whl

python pip.exe install scrapy

安装：　　win10 安装Scrapy：　　2018-9-12

当前环境：win10，python3.7.0。python目录是：c:/python3/

直接安装scrapy还是有问题。所以先预装一下Twisted。

http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted　　下载 Twisted‑18.7.0‑cp37‑cp37m‑win_amd64.whl。

把它放到 C:\Python3\Scripts\　，然后进到该目录

pip install Twisted-18.7.0-cp37-cp37m-win_amd64.whl

pip install scrapy

学习中：

cd c:\Python3\zz\　　　　　　　　　　#  C:\Python3\zz\  ，是我放项目的文件夹

python C:/Python3/Scripts/scrapy.exe startproject plant　　# 建立一个叫做 plant的 爬虫项目

C:\Python3\zz\plant\

├ scrapy.cfg: 　　项目的配置文件
├ plant/: 　　该项目的 python 模块。之后您将在此加入代码。
├ plant/items.py: 　　项目中的 item 文件。
├ plant/pipelines.py: 　　项目中的 pipelines 文件。
├ plant/settings.py: 　　项目的设置文件。
└ plant/spiders/: 　　放置 spider 代码的目录。

编辑 items.py

import scrapy

class DmozItem(scrapy.Item):

    title = scrapy.Field()

    link = scrapy.Field()

    desc = scrapy.Field()

编写第一个爬虫(Spider)，创建文件 C:\Python3\zz\plant\plant\spiders\quotes_spider.py

下面这两步，是看教程： https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project　　，但是本机报错，明天再试

import scrapy

class QuotesSpider(scrapy.Spider):

    name = "quotes"

    def start_requests(self):

        urls = [

            'http://quotes.toscrape.com/page/1/',

            'http://quotes.toscrape.com/page/2/',

        ]

        for url in urls:

            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        page = response.url.split("/")[-2]

        filename = 'quotes-%s.html' % page

        with open(filename, 'wb') as f:

            f.write(response.body)

        self.log('Saved file %s' % filename)

进入项目文件夹，运行：

cd c:\Python3\zz\plant\

scrapy crawl quotes

正在看：

https://docs.scrapy.org/en/latest/intro/overview.html

建立一个文件 quotes_spider.py

输入命令：　　scrapy runspider quotes_spider.py -o quotes.json　　，然后生成了文件： quotes.json

python scrapy.exe runspider c:/Python3/zz/quotes_spider.py -o c:/Python3/zz/quotes.json　　（我win7上还是没装好）

虽然可以使用pip在Windows上安装Scrapy，但我们建议您安装Anaconda或Miniconda，并使用conda - forge通道中的软件包，以避免大多数安装问题。

https://docs.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes　　平台特定的安装说明

http://docs.continuum.io/anaconda/index　　Anaconda

http://conda.pydata.org/docs/install/quick.html　　Miniconda

Scrapy是用纯Python编写的，并且取决于几个关键的Python包（其中包括）：

lxml，一个高效的XML和HTML解析器
parsel，一个写在lxml之上的HTML / XML数据提取库，
w3lib，一个用于处理URL和网页编码的多功能助手
twisted，异步网络框架
cryptography and pyOpenSSL，以处理各种网络级的安全需求

Scrapy测试的最小版本是：
twisted14.0
lxml 3.4
pyOpenSSL 0.14

建立一个新项目 tutorial，输入命令：　　scrapy startproject tutorial

├ scrapy.cfg: 　　项目的配置文件
├ tutorial/: 　　该项目的 python 模块。之后您将在此加入代码。
├ tutorial/items.py: 　　项目中的 item 文件。
├ tutorial/pipelines.py: 　　项目中的 pipelines 文件。
├ tutorial/settings.py: 　　项目的设置文件。
└ tutorial/spiders/: 　　放置 spider 代码的目录。

建一个新蜘蛛，quotes_spider.py，文件保存路径：tutorial/tutorial/spiders/quotes_spider.py

# -*- coding: utf-8 -*-

# coding=utf-8

import scrapy

class QuotesSpider(scrapy.Spider):

    #name 识别蜘蛛。它在项目中必须是唯一的，也就是说，您不能为不同的Spiders设置相同的名称。

    name = "quotes"

    #必须返回一个可迭代的请求（您可以返回一个请求列表或写一个生成器函数），Spider将开始爬行。随后的请求将从这些初始请求连续生成。

    def start_requests(self):

        urls = [

            'http://quotes.toscrape.com/page/1/',

            'http://quotes.toscrape.com/page/2/',

        ]

        for url in urls:

            yield scrapy.Request(url=url, callback=self.parse)

    # 一种将被调用来处理为每个请求下载的响应的方法。响应参数是一个TextResponse保存页面内容的实例，并且还有其他有用的方法来处理它。

    # 该parse()方法通常解析响应，将刮取的数据提取为dicts，还可以查找新URL以从中创建新的请求（Request）。

    def parse(self, response):

        page = response.url.split("/")[-2]

        filename = 'quotes-%s.html' % page

        with open(filename, 'wb') as f:

            f.write(response.body)

        self.log('Saved file %s' % filename)

运行我们的新蜘蛛，输入命令：　　scrapy crawl quotes　　，（在哪个目录下运行此命令，爬到的那一堆页面，就会被存在哪个目录中）

进入一种类似于命令行的模式，输入命令：　　scrapy shell "http://quotes.toscrape.com/page/1/"

response.css('title::text').extract() # 得到 ['Quotes to Scrape']

response.css('title').extract() # 得到 ['<title>Quotes to Scrape</title>']

response.xpath('//title') # 得到 [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

进入shell中输入命令：（下面这些是要在shell中执行的命令，不是文件）

scrapy shell 'http://quotes.toscrape.com'

response.css("div.quote")

quote = response.css("div.quote")

quote

tags = quote.css("div.tags a.tag::text").extract()

tags

for quote in response.css("div.quote"):

    text = quote.css("span.text::text").extract_first()

    author = quote.css("small.author::text").extract_first()

    tags = quote.css("div.tags a.tag::text").extract()

    print(dict(text=text, author=author, tags=tags))

2017-10-30、2017-12-8 练习：　　源码在这里，https://github.com/Germey/ScrapyTutorial

一、创建项目：

scrapy startproject tutorial

二、创建爬虫：

cd tutorial

scrapy genspider　　quotes quotes.toscrape.com　　在当前项目中创建spider。（创建了一个爬虫文件： \tutorial\tutorial\spiders\quotes.py）

scrapy genspider　　myfirst frps.eflora.cn　　　　在当前项目中创建spider。（创建了一个爬虫文件： \plant\plant\spiders\myfirst.py）

\tutorial\tutorial\spiders\quotes.py　　（看爬虫的3个属性： name，allowed_domains，start_urls）

# -*- coding: utf-8 -*-

import scrapy

class QuotesSpider(scrapy.Spider):

    name = 'quotes'　　　　　　　　　　　　　　　　　　#每个项目里名字是唯一的，用来区分不同的Spider。

    allowed_domains = ['quotes.toscrape.com']　　 #允许爬取的域名，如果初始或后续的请求链接不是这个域名下的，就会被过滤掉

    start_urls = ['http://quotes.toscrape.com/']　　#包含了Spider在启动时爬取的url列表，初始请求是由它来定义的。

    def parse(self, response):

        pass

1、禁止ROBOTSTXT_OBEY：

打开settings.py文件第22行，修改为：ROBOTSTXT_OBEY = False。（在Scrapy启动后，会在第一时间访问网站的robots.txt 文件，然后决定该网站的爬取范围。）

2、尝试最初的爬取

scrapy crawl quotes 　　（接下来我们什么代码也不修改，执行爬取，运行此命令）

scrapy -h　　　　　#查看可用的命令的帮助信息

scrapy crawl -h 　　#查看某个命令更多的帮助信息

3、认识一下这些命令：

全局命令:

scrapy startproject myproject　　 Create new project 创建新的项目
scrapy settings 　　 Get settings values 获取设置值
scrapy runspider 　　 Run a self-contained spider (without creating a project) 运行一个自足的蜘蛛（不创建一个项目）
scrapy shell 　　　　 Interactive scraping console 互动刮控制台。例：scrapy shell http://www.example.com/some/page.html
scrapy fetch　　　　　Fetch a URL using the Scrapy downloader 取一个URL使用Scrapy下载。例：scrapy fetch --nolog http://www.example.com/some/page.html
scrapy view　　　　 Open URL in browser, as seen by Scrapy 在浏览器中打开网址，看到Scrapy。例： scrapy view http://www.example.com/some/page.html
scrapy version 　　 Print Scrapy version 打印Scrapy版本

项目(Project-only)命令:

scrapy crawl myfirst  　 Run a spider 使用spider进行爬取。
scrapy check myfirst 　 Check spider contracts 检查蜘蛛的合同
scrapy list 　　　　　  List available spiders 列出可用的蜘蛛
scrapy edit myfirst 　　 Edit spider 使用 EDITOR 中设定的编辑器编辑给定的spider（这个没啥用）
scrapy parse 　　    Parse URL (using its spider) and print the results 解析URL（使用它的蜘蛛）并打印结果。http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/commands.html#parse
scrapy genspider 　　 Generate new spider using pre-defined templates 在当前项目中创建spider。
scrapy deploy 　　将项目部署到Scrapyd服务。
scrapy bench 　　　　 Run quick benchmark test 运行快速基准测试

三、创建Item：

Item是保存爬取数据的容器，它的使用方法和字典类似，虽然你可以用字典来表示，不过Item相比字典多了额外的保护机制，可以避免拼写错误或者为定义字段错误。

创建Item需要继承scrapy.Item类，并且定义类型为scrapy.Field的类属性来定义一个Item。观察目标网站，我们可以获取到到内容有text, author, tags

修改items.py如下：

import scrapy

class QuoteItem(scrapy.Item):

    text = scrapy.Field()

    author = scrapy.Field()

    tags = scrapy.Field()

四、解析Response：

quotes.py 的parse方法改写如下：

def parse(self, response):

    quotes = response.css('.quote')

    for quote in quotes:

        text = quote.css('.text::text').extract_first()

        author = quote.css('.author::text').extract_first()

        tags = quote.css('.tags .tag::text').extract()

http://www.cnblogs.com/qcloud1001/p/6826573.html　　看到这里，下次继续看

http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/items.html　　看到这里，下次继续看

scrapy学习笔记1、2、3：

https://www.imooc.com/article/21838

https://www.imooc.com/article/21839

https://www.imooc.com/article/21840

....

py库： scrapy （深坑未填）的更多相关文章

2018牛客暑期ACM多校训练营第一场（有坑未填）
(重新组队后的第一场组队赛也是和自己队友的一次磨合吧这场比赛真的算是一个下马威吧……队友上手一看啊这不是莫队嘛然后开敲敲完提交发现t了在改完了若干个坑点后还是依然t(真是一个悲伤的故事)然 ...
2018牛客暑期ACM多校训练营第二场（有坑未填）
第二场终于等来学弟开始(被队友带飞)的开心(被虐)多校之旅 A run A题是一个递推(dp?)+前缀和因为看数据量比较大就直接上前缀和了一个比较简单的递推没有太多难点签到题需要注意 ...
获取网页title（还有一坑未填）
def getTitle(self,url): #get title title = 'time out' try: self.res = requests.get(url,timeout=5) so ...
编辑器：IDE（深坑不填）
http://top.jobbole.com/37542/ Facebook 和 GitHub 两大巨头联手推出 Atom-IDE 2017-9-22 https://www.zhihu.com/qu ...
android MultiDex multidex原理原理下遇见的N个深坑（二）
android MultiDex 原理下遇见的N个深坑(二) 这是在一个论坛看到的问题,其实你不知道MultiDex到底有多坑. 不了解的可以先看上篇文章:android MultiDex multi ...
这个PHP无解深坑，你能解出来吗？（听说能解出来的都很秀）
欢迎大家前往腾讯云+社区,获取更多腾讯海量技术实践干货哦~ 本文由horstxu发表于云+社区专栏 1. 问题背景 PHP Laravel框架中的db migration是比较常用的一个功能了.在每个 ...
Python爬虫库Scrapy入门1--爬取当当网商品数据
1.关于scrapy库的介绍,可以查看其官方文档:http://scrapy-chs.readthedocs.io/zh_CN/latest/ 2.安装:pip install scrapy 注意这 ...
Go语言第一深坑：interface 与 nil 的比较
interface简介 Go 语言以简单易上手而著称,它的语法非常简单,熟悉 C++,Java 的开发者只需要很短的时间就可以掌握 Go 语言的基本用法. interface 是 Go 语言里所提供的 ...
xamarin绑定原生库的一些坑
最近一个项目涉及到较多的第三方库的绑定技术,中间遇到了几个坑,记录下来与大家分享绑定Jar库 monoandroid对原生库的调用都通过Android.Runtime.JNIEnv进行调入(http ...

随机推荐

python show slave status
#!/usr/bin/env python import MySQLdbimport contextlib @contextlib.contextmanagerdef mysql(Host,Port, ...
黄聪：保持web页面生成的app一直处于用户登录状态不退出
用户登录了会员中心,怎么保持登录状态! 由于封壳的内核及组件肯定没有浏览器APP应用那么强大,所以目前暂时的解决方案是: jquery.cookie.js 本文转载至:https://www.cnb ...
【linux】常用命令-端口
端口操作手动更改配置文件开放端口 vim /etc/sysconfig/iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 81 ...
Tomacat 配置
server.xml文件中元素: 1.<Service name="Catalina"> 这个元素相当于IIS的一个网站.该元素可有多个.每个元素会根据名字在conf文 ...
quartz中设置Job不并发执行
使用quartz框架可以完成定时任务处理即Job,比如有时候我们设置1个Job每隔5分钟执行1次,后来会发现当前Job启动的时候上一个Job还没有运行结束,这显然不是我们期望的,此时可以设置quart ...
Unity3D工程版本管理方案【转自兜里】
参阅:http://outofmemory.cn/wr/?u=http%3A%2F%2Fblog.dou.li%2Funity3d%25e5%25b7%25a5%25e7%25a8%258b%25e7 ...
量化交易(Quantitative Trading)
什么是量化交易量化交易是指借助现代统计学和数学的方法,利用计算机技术来进行交易的证券投资方式.量化交易从庞大的历史数据中海选能带来超额收益的多种“大概率”事件以制定策略,用数量模型验证及固化这些规律 ...
go语言学习--指针数组和数组指针
数组指针(也称行指针)定义 int (*p)[n];()优先级高,首先说明p是一个指针,指向一个整型的一维数组,这个一维数组的长度是n,也可以说是p的步长.也就是说执行p+1时,p要跨过n个整型数据的 ...
java打印一下九九乘法表
public class Multiplication { public static void main(String[] args) { printTable(); } // 打印九九乘法表 pu ...
[UE4]小技巧：自动添加函数返回值
将一个变量拖放到返回节点上面会自动创建响应类型的返回值同样的,函数参数也可以这样来做:

py库： scrapy （深坑未填）

py库： scrapy （深坑未填）的更多相关文章

随机推荐

热门专题