python之scrapy篇(二)

一、创建工程

scarpy startproject xxx

二、编写iteam文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class TestScrapyItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    name = scrapy.Field()

    title = scrapy.Field()

    info = scrapy.Field()

二、编写setting文件

# -*- coding: utf-8 -*-

# Scrapy settings for test_scrapy project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://docs.scrapy.org/en/latest/topics/settings.html

#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'test_scrapy'

SPIDER_MODULES = ['test_scrapy.spiders']

NEWSPIDER_MODULE = 'test_scrapy.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'test_scrapy (+http://www.yourdomain.com)'

# Obey robots.txt rules

# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

   # 'Accept-Language': 'en',

}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'test_scrapy.middlewares.TestScrapySpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'test_scrapy.middlewares.TestScrapyDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'test_scrapy.pipelines.TestScrapyPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

三、进入spider文件(cmd创建自定义爬虫文件)

scrapy genspider demo 'www.douyu.com'

编写代码

import scrapy

from test_scrapy.items import TestScrapyItem

class ItcastSpider(scrapy.Spider):

    # 爬虫名

    name = "itcast"

    # 允许爬虫作用的范围

    allowd_domains = ['http://www.xxxx.cn/']

    # 爬虫起始的url

    start_urls = ["http://www.xxxx.cn/channel/teacher.shtml#ajavaee"]

    def parse(self, response):

        # with open("teacher.html", "w") as f:

        #     f.write(response.body)

        # 通过自带的xpath匹配出所有老师的根节点列表集合

        teacher_list = response.xpath('//div[@class="li_txt"]')

        # 所有老师信息的列表集合

        # 遍历根节点集合

        for each in teacher_list:

            # 保存数据

            item = TestScrapyItem()

            # name,extract()将匹配出来的结果转换为unicode字符串

            # 不加extract()结果为xpath匹配对象

            name = each.xpath('./h3/text()').extract()

            # title

            title = each.xpath('./h4/text()').extract()

            # info

            info = each.xpath('./p/text()').extract()

            item['name'] = name[0]

            item['title'] = title[0]

            item['info'] = info[0]

            yield item

四、运行

scrapy crawl xxxx

OVER!

python之scrapy篇(二)的更多相关文章

Python：Scrapy（二）实例分析与总结、写一个爬虫的一般步骤
学习自:Scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 - 知乎 Python Scrapy 爬虫框架实例(一) - Blue·Sky - 博客园 1.声明Item 爬虫爬取的目标是从非 ...
[Python学习]错误篇二：切换当前工作目录时出错——FileNotFoundError: [WinError 3] 系统找不到指定的路径
REFERENCE:<Head First Python> ID:我的第二篇[Python学习] BIRTHDAY:2019.7.13 EXPERIENCE_SHARING:解决切换当前工 ...
python之scrapy篇(三)
一.创建工程(cmd) scrapy startproject xxxx 二.编写item文件 # -*- coding: utf-8 -*- # Define here the models for ...
python之scrapy篇(一)
一.首先创建工程(cmd中进行) scrapy startproject xxx 二.编写Item文件添加要字段 # -*- coding: utf-8 -*- # Define here the ...
Python项目--Scrapy框架(二)
本文主要是利用scrapy框架爬取果壳问答中热门问答, 精彩问答的相关信息环境 win8, python3.7, pycharm 正文 1. 创建scrapy项目文件在cmd命令行中任意目录下执行 ...
[Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍
前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...
python爬虫Scrapy(一)-我爬了boss数据
一.概述学习python有一段时间了,最近了解了下Python的入门爬虫框架Scrapy,参考了文章Python爬虫框架Scrapy入门.本篇文章属于初学经验记录,比较简单,适合刚学习爬虫的小伙伴. ...
Python的单元测试（二）
title: Python的单元测试(二) date: 2015-03-04 19:08:20 categories: Python tags: [Python,单元测试] --- 在Python的单 ...
Python爬虫Scrapy框架入门（0）
想学习爬虫,又想了解python语言,有个python高手推荐我看看scrapy. scrapy是一个python爬虫框架,据说很灵活,网上介绍该框架的信息很多,此处不再赘述.专心记录我自己遇到的问题 ...

随机推荐

JZOJ2020年8月13日提高组反思
JZOJ2020年8月13日提高组反思 T1 打了3h+,然后自己的小数据都没过果断选择交对拍的暴力下次还是注意时间吧 T2 一下三题都没时间打了看了题目觉得特别烦人(有式子) 再看发现式子类似 ...
使用Docker快速部署各类服务
使用Docker快速部署各类服务一键安装Docker #Centos环境 wget -O- https://gitee.com/iubest/dinstall/raw/master/install. ...
PyQt学习随笔：自定义信号连接时报AttributeError: 'PyQt5.QtCore.pyqtSignal' object has no attribute 'connect'
专栏:Python基础教程目录专栏:使用PyQt开发图形界面Python应用专栏:PyQt入门学习老猿Python博文目录如果使用自定义信号,一定要记得信号是类变量,必须在类中定义,不能在实例 ...
PyQt学习随笔：QWidget的QFont的kerning、Antialiasing属性用途
专栏:Python基础教程目录专栏:使用PyQt开发图形界面Python应用专栏:PyQt入门学习老猿Python博文目录引言在Designer中,QWidget的font属性有两个比较陌生 ...
PyQt（Python+Qt）学习随笔：containers容器类部件QStackedWidget重要方法介绍
老猿Python博文目录专栏:使用PyQt开发图形界面Python应用老猿Python博客地址 StackedWidget堆叠窗口部件为一系列窗口部件的堆叠,对应类为QStackedWidget. ...
第15.14节 PyQt(Python+Qt)入门学习：Designer的Buttons按钮详解
一.引言 Qt Designer中的Buttons部件包括Push Button(常规按钮.一般称按钮).Tool Button(工具按钮).Radio Button(单选按钮).Check Box( ...
关于utf-8编码值 [ASIS 2019]Unicorn shop
0x00 前言这题拿到之后有点懵,后来看了网上的 wp 更加懵,网上大多数都是直接说去 compart 搜thousand,然后找个大于1337 的就可以,至于为什么?基本都没有给出解答.于是乎 ...
我摊牌了，大厂面试Linux就这5个问题
说真的,这就是<我想进大厂>系列第八篇,但是Linux的问题确实很少,就这样,强行编几个没有营养的问题也没啥意义. 1.CPU负载和CPU利用率的区别是什么? 首先,我们可以通过uptim ...
推荐系统（CTR领域）实战入门指南
CTR经典模型如:FM,FFM,Wide&Deep,建议自己去复现一个完整的通用模型先从pytorch版本入手(后期考虑tensorflow),从kaggle上找实际的比赛 github 相 ...
docker redis 设置和使用
1 开启docker 拉取redis镜像 1.1 桌面版docker 在镜像所在位置命令行执行 docker load -i redis.tar 1.2 开启redis docker run -p ...

python之scrapy篇(二)

python之scrapy篇(二)的更多相关文章

随机推荐

热门专题