21天打造分布式爬虫-Crawl类爬取小程序社区（八）

8.1.Crawl的用法实战

新建项目

scrapy startproject wxapp

scrapy genspider -t crawl wxapp_spider "wxapp-union.com"

wxapp_spider.py

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from wxapp.items import WxappItem

class WxappSpiderSpider(CrawlSpider):

    name = 'wxapp_spider'

    allowed_domains = ['wxapp-union.com']

    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (

        Rule(LinkExtractor(allow=r'.+mod=list&catid=\d'), follow=True),

        Rule(LinkExtractor(allow=r'.+article-.+\.html'), callback="parse_detail",follow=False),

    )

    def parse_detail(self, response):

        title = response.xpath("//h1[@class='ph']/text()").get()

        author_p = response.xpath("//p[@class='authors']")

        author = author_p.xpath(".//a/text()").get()

        pub_time = author_p.xpath(".//span/text()").get()

        article_content = response.xpath("//td[@id='article_content']//text()").getall()

        content = "".join(article_content).strip()

        item = WxappItem(title=title,author=author,pub_time=pub_time,content=content)

        return item

items.py

# -*- coding: utf-8 -*-

import scrapy

class WxappItem(scrapy.Item):

    title = scrapy.Field()

    author = scrapy.Field()

    pub_time = scrapy.Field()

    content = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonLinesItemExporter

class WxappPipeline(object):

    def __init__(self):

        self.fp = open('wxapp.json','wb')

        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')

    def process_item(self, item, spider):

        self.exporter.export_item(item)

        return item

    def close_spider(self, spider):

        self.fp.close()

settings.py

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {

  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',

}

ITEM_PIPELINES = {

   'wxapp.pipelines.WxappPipeline': 300,

}

start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl wxapp_spider".split())

21天打造分布式爬虫-Crawl类爬取小程序社区（八）的更多相关文章

21天打造分布式爬虫-Spider类爬取糗事百科（七）
7.1.糗事百科安装 pip install pypiwin32 pip install Twisted-18.7.0-cp36-cp36m-win_amd64.whl pip install sc ...
21天打造分布式爬虫-Selenium爬取拉钩职位信息（六）
6.1.爬取第一页的职位信息第一页职位信息 from selenium import webdriver from lxml import etree import re import time c ...
21天打造分布式爬虫-urllib库（一）
1.1.urlopen函数的用法 #encoding:utf-8 from urllib import request res = request.urlopen("https://www. ...
21天打造分布式爬虫-requests库（二）
2.1.get请求简单使用 import requests response = requests.get("https://www.baidu.com/") #text返回的是 ...
爬虫实战——Scrapy爬取伯乐在线所有文章
Scrapy简单介绍及爬取伯乐在线所有文章一.简说安装相关环境及依赖包 1.安装Python(2或3都行,我这里用的是3) 2.虚拟环境搭建: 依赖包:virtualenv,virtualenvwr ...
【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
Python爬虫实战之爬取百度贴吧帖子
大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 本篇目标对百度贴吧的任意帖子进行抓取指定是否只抓取楼主发帖 ...
爬虫入门之爬取策略 XPath与bs4实现(五)
爬虫入门之爬取策略 XPath与bs4实现(五) 在爬虫系统中,待抓取URL队列是很重要的一部分.待抓取URL队列中的URL以什么样的顺序排列也是一个很重要的问题,因为这涉及到先抓取那个页面,后抓取哪 ...
Python 网络爬虫 002 (入门) 爬取一个网站之前，要了解的知识
网站站点的背景调研 1. 检查 robots.txt 网站都会定义robots.txt 文件,这个文件就是给网络爬虫来了解爬取该网站时存在哪些限制.当然了,这个限制仅仅只是一个建议,你可以遵守,也 ...

随机推荐

100-days: sixteen
Title: The world's most expensive cities 生活成本最高的城市 For the first time in its 30-year history, the Wo ...
Annotation 标注
1.画出基本图当图线中某些特殊地方需要标注时,我们可以使用 annotation. matplotlib 中的 annotation 有两种方法, 一种是用 plt 里面的 annotate,一种是 ...
AI制作icon标准参考线与多面板复制
新建10个25x25像素,色值为RGB的画板在视图中打开显示网格打开首选项参考线和网格,间隔和隔线都设为1 新建一个20x20像素前景色为空描边为1像素的正方形选择对齐选项中的对齐画板,使之与画 ...
Netsharp配置文件
一.总体说明 netsharp下需要配置的项目一般是需要独立启动的项目,主要有四个 netsharp-web netsharp-test netsharp-elephant netsharp-donk ...
windows 性能监视器常用计数器
转载地址:https://www.jianshu.com/p/f4406c29542a?utm_campaign=maleskine&utm_content=note&utm_medi ...
.NET winform播放音频文件
前提:最近要求做一个在winform端做一个音频文件播放的功能,至此,总结最近搜寻的相关资料. 一.微软提供了三种方式来播放音频文件 1.通过System.Media.SoundPlayer来播放 2 ...
getObjectURL 上传图片预览
js 函数 function getObjectURL(file) { var url = null ; if (window.createObjectURL!=undefined) { ...
在linux中文件的权限讲解
1.d:directory(目录): 表示这个文件是个目录,其他的还有f(file文件)等等: 2.r:read(可读) 3.w:write(可写) 4 x :execute(可执行) 一般Linux ...
drf5 版本和认证组件
开发项目是有多个版本的随着项目的更新,版本就越来越多.不可能新的版本出了,以前旧的版本就不进行维护了那我们就需要对版本进行控制,这个DRF框架也给我们提供了一些封装好的版本控制方法版本控制组件 ...
JavaScript基础视频教程总结（041-050章）
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <title&g ...

21天打造分布式爬虫-Crawl类爬取小程序社区（八）

8.1.Crawl的用法实战

21天打造分布式爬虫-Crawl类爬取小程序社区（八）的更多相关文章

随机推荐

热门专题