scrapy入门实战-爬取代理网站

入门scrapy。

学习了有这几点

1.如何使用scrapy框架对网站进行爬虫；

2.如何对网页源代码使用xpath进行解析；

3.如何书写spider爬虫文件，对源代码进行解析；

4.学会使用scrapy的基础命令，创建项目，使用模板生成一个爬虫文件spider；

5,通过配置settings.py反爬虫。如设置user-agent；

设定目标:爬取网络代理www.xicidaili.com网站。

使用scrapy startproject 项目名称

scrapy startproject xicidailiSpider

项目名称应该如何命名呢：建议是需要爬虫的域名+Spider.举个例子：比如要爬取www.zhihu.com,那么项目名称可以写成zhihuSpider。

2. 目录中spiders放置的是爬虫文件，然后middlewares.py是中间件，有下载器的中间件，有爬虫文件的中间件。pipelines.py是管道文件，是对spider爬虫文件解析数据的处理。settings.py是设置相关属性，是否遵守爬虫的robotstxt协议，设置User-Agent等。

3.可以使用scrapy提供的模板，命令如下：

scrapy genspider 爬虫名字需要爬虫的网络域名

举例子：

我们需要爬取的www.xicidaili.com

那么可以使用

scarpy genspider xicidaili xicidaili.com

命令完成后，最终的目录如下：

建立后项目后，需要对提取的网页进行分析

经常使用的有三种解析模式：

1.正则表达式

2 xpath response.xpath("表达式")

3 css response.css("表达式")

XPath的语法是w3c的教程。http://www.w3school.com.cn/xpath/xpath_syntax.asp

需要安装一个xpath helper插件在浏览器中，可以帮助验证书写的xpath是否正确。

xpath语法需要多实践，看确实不容易记住。

xicidaili.py

# -*- coding: utf-8 -*-

import scrapy

# 继承scrapy,Spider类

class XicidailiSpider(scrapy.Spider):

    name = 'xicidaili'

    allowed_domains = ['xicidaili.com']

    start_urls = ['https://www.xicidaili.com/nn/',

                  "https://www.xicidaili.com/nt/",

                  "https://www.xicidaili.com/wn/,"

                  "https://www.xicidaili.com/wt/"]

    # 解析响应数据，提取数据和网址等。

    def parse(self, response):

        selectors = response.xpath('//tr')

        for selector in selectors:

            ip = selector.xpath("./td[2]/text()").get()

            port = selector.xpath("./td[3]/text()").get()     #.代表当前节点下

            country = selector.xpath("./td[4]/a/text()").get()   # get()和extract_first() 功能相同，getall()获取多个

            # print(ip,port,country)

            Items={

                "ip":ip,

                "port":port,

                "country":country

            }

            yield  Items

        """

        # 翻页操作

        # 获取下一页的标签

        next_page = response.xpath("//a[@class='next_page']/@href").get()

        # 判断next_page是否有值，也就是是否到了最后一页

        if next_page:

            # 拼接网页url---response.urljoin

            next_url = response.urljoin(next_page)

            # 判断最后一页是否

            yield  scrapy.Request(next_url,callback=self.parse)   # 回调函数不要加括号

    """

# -*- coding: utf-8 -*-

# settings.py设置

# Scrapy settings for xicidailiSpider project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'xicidailiSpider'

SPIDER_MODULES = ['xicidailiSpider.spiders']

NEWSPIDER_MODULE = 'xicidailiSpider.spiders'

# 设置到处文件的字符编码

FEED_EXPORT_ENCODING ="UTF8"

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'xicidailiSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules

# 是否准售robots.txt协议，不遵守

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

   'Accept-Language': 'en',

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \

    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'

}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'xicidailiSpider.middlewares.XicidailispiderSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'xicidailiSpider.middlewares.XicidailispiderDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

#    'xicidailiSpider.pipelines.XicidailispiderPipeline': 300,

#}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

　运行

scrapy crawl xicidai 项目名，这个必须唯一。

如果需要输出文件，

scarpy crawl xicidaili --output ip.json 或者ip.csv　

scrapy入门实战-爬取代理网站的更多相关文章

scrapy框架来爬取壁纸网站并将图片下载到本地文件中
首先需要确定要爬取的内容,所以第一步就应该是要确定要爬的字段: 首先去items中确定要爬的内容 class MeizhuoItem(scrapy.Item): # define the fields ...
Scrapy爬虫实战-爬取体彩排列5历史数据
网站地址:http://www.17500.cn/p5/all.php 1.新建爬虫项目 scrapy startproject pfive 2.在spiders目录下新建爬虫 scrapy gens ...
scrapy爬虫框架爬取招聘网站
目录结构 BossFace.py文件中代码: # -*- coding: utf-8 -*-import scrapyfrom ..items import BossfaceItemimport js ...
实战爬取某网站图片-Python
直接上代码 1 #!/usr/bin/python 2 # -*- coding: UTF-8 -*- 3 from bs4 import BeautifulSoup 4 import request ...
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...
python爬虫-基础入门-爬取整个网站《3》
python爬虫-基础入门-爬取整个网站<3> 描述: 前两章粗略的讲述了python2.python3爬取整个网站,这章节简单的记录一下python2.python3的区别 python ...
python爬虫-基础入门-爬取整个网站《2》
python爬虫-基础入门-爬取整个网站<2> 描述: 开场白已在<python爬虫-基础入门-爬取整个网站<1>>中描述过了,这里不在描述,只附上 python3 ...
python爬虫-基础入门-爬取整个网站《1》
python爬虫-基础入门-爬取整个网站<1> 描述: 使用环境:python2.7.15 ,开发工具:pycharm,现爬取一个网站页面(http://www.baidu.com)所有数 ...
Python 网络爬虫 002 (入门) 爬取一个网站之前，要了解的知识
网站站点的背景调研 1. 检查 robots.txt 网站都会定义robots.txt 文件,这个文件就是给网络爬虫来了解爬取该网站时存在哪些限制.当然了,这个限制仅仅只是一个建议,你可以遵守,也 ...

随机推荐

Houdini学习笔记——【案例二】消散文字制作
[案例二]Houdini消散文字制作一.Overview 文字通过时间轴中frame变化而碎裂从两边开始向着中间消散并向镜头移动. 效果二.Sop(Surface OPerators or ...
Linux 学习（五） DNS配置
没有配置DNS会引起的问题 yum命令 ssh命令等不能进行错误: Could not resolve host: centos.ustc.edu.cn; 本文例子: CentOS7 下DNS配置 ...
CentOS 7在VMware 12中共享文件看不见的问题？
前言由于rhel 7.2因为没有注册导致yum无法使用,包括自己配置本地源,这个命令在你没有注册都不能使用,每次使用rpm去装软件,自己去找缺少的依赖包,实在是麻烦.于是不如就换一个系统,CentO ...
idea中以maven工程的方式运行tomcat源码
0. 准备环境 idea+jdk8+tomcat源码 1.下载tomcat源码: http://mirrors.tuna.tsinghua.edu.cn/apache/tomcat/tomcat-8/ ...
fatal: early EOF fatal: index-pack failed & Git, fatal: The remote end hung up unexpectedly
https://stackoverflow.com/questions/15240815/git-fatal-the-remote-end-hung-up-unexpectedly https://s ...
GET和POST的数据传递到底有何区别？
1. GET和POST与数据如何传递没有关系 GET和POST是由HTTP协议定义的.在HTTP协议中,Method和Data(URL, Body, Header)是正交的两个概念,也就是说,使用哪个 ...
实验报告（六）&第八周学习总结
班级计科二班学号 20188425 姓名 IM 完成时间2019/10/18 评分等级实验六 Java异常实验目的理解异常的基本概念: 掌握异常处理方法及熟悉常见异常的捕获方法. 实验要求 ...
A Bite Of React(1)
react: component and views : produce html abd add them on a page( in the dom) <import React from ...
44.Linked List Cycle II（环的入口节点）
Level: Medium 题目描述: Given a linked list, return the node where the cycle begins. If there is no cy ...

scrapy入门实战-爬取代理网站

scrapy入门实战-爬取代理网站的更多相关文章

随机推荐

热门专题