1、创建项目

在开始爬取之前，您必须创建一个新的Scrapy项目。进入您打算存储代码的目录中，运行新建命令。

例如，我需要在D:\00Coding\Python\scrapy目录下存放该项目，打开命令窗口，进入该目录，执行以下命令：

scrapy startproject tutorial

PS:tutorial可以替换成任何你喜欢的名称，最好是英文

该命令将会创建包含下列内容的 tutorial 目录:

tutorial/

    scrapy.cfg

    tutorial/

        __init__.py

        items.py

        pipelines.py

        settings.py

        spiders/

            __init__.py

            ...

这些文件分别是:

scrapy.cfg: 项目的配置文件

tutorial/: 该项目的python模块。之后您将在此加入代码。

tutorial/items.py: 项目中的item文件.

tutorial/pipelines.py: 项目中的pipelines文件.

tutorial/settings.py: 项目的设置文件.

tutorial/spiders/: 放置spider代码的目录.

2、定义Item

Item 是保存爬取到的数据的容器；其使用方法和python字典类似，并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。我们需要从想要爬取的网站（这里爬取新浪新闻）中获取以下属性：

新闻大类url、新闻大类title；

新闻小类url、新闻小类title；

新闻url、新闻title；

新闻标题、新闻内容；

对此，在item中定义相应的字段。编辑tutorial目录中的 items.py 文件:

from scrapy.item import Item, Field
class TutorialItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
parent_title = Field()
parent_url = Field()
second_title = Field()
second_url = Field()
path = Field()
link_title = Field()
link_url = Field()
head= Field()
content = Field()
pass

3、编写爬虫(Spider)

Spider是用户编写用于从单个网站(或者一些网站)爬取数据的类。

1、sinaSpider.py文件：

包含了一个用于下载的初始URL，如何跟进网页中的链接以及如何分析页面中的内容，提取生成 item 的方法。为了创建一个Spider，您必须继承 scrapy.Spider 类，且定义以下三个属性:

name:用于区别Spider。该名字必须是唯一的，您不可以为不同的Spider设定相同的名字。

start_urls:包含了Spider在启动时进行爬取的url列表。因此，第一个被获取到的页面将是其中之一。后续的URL则从初始的URL获取到的数据中提取。

parse() 是spider的一个方法。被调用时，每个初始URL完成下载后生成的Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data)，提取数据(生成item)以及生成需要进一步处理的URL的Request 对象。

当我们爬取了大类，然后这时候没有保存item，而是传递item到小类，爬取完小类之后，我们需要去新闻详情页爬取新闻的内容和标题：

主要思路是：paser->second_paser->detail_parse

以下是sinaSpider的全部代码：

# -*-coding: utf-8 -*-
__author__= 'George'
import sys, os
reload(sys)
sys.setdefaultencoding("utf-8")
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import Selector
from tutorial.items import TutorialItem
base ="d:/dataset/" #存放文件分类的目录
class SinaSpider(Spider):
name= "sina"
allowed_domains= ["sina.com.cn"]
start_urls= [
"http://news.sina.com.cn/guide/"
]#起始urls列表
def parse(self, response):
items= []
sel= Selector(response)
big_urls=sel.xpath('//div[@id=\"tab01\"]/div/h3/a/@href').extract()#大类的url
big_titles=sel.xpath("//div[@id=\"tab01\"]/div/h3/a/text()").extract()
second_urls =sel.xpath('//div[@id=\"tab01\"]/div/ul/li/a/@href').extract()#小类的url
second_titles=sel.xpath('//div[@id=\"tab01\"]/div/ul/li/a/text()').extract()
for i in range(1,len(big_titles)-1):#这里不想要第一大类,big_title减去1是因为最后一个大类，没有跳转按钮，也去除
file_name = base + big_titles[i]
#创建目录
if(not os.path.exists(file_name)):
os.makedirs(file_name)
for j in range(19,len(second_urls)):
item = TutorialItem()
item['parent_title'] =big_titles[i]
item['parent_url'] =big_urls[i]
if_belong =second_urls[j].startswith( item['parent_url'])
if(if_belong):
second_file_name =file_name + '/'+ second_titles[j]
if(not os.path.exists(second_file_name)):
os.makedirs(second_file_name)
item['second_url'] = second_urls[j]
item['second_title'] =second_titles[j]
item['path'] =second_file_name
items.append(item)
for item in items:
yield Request(url=item['second_url'],meta={'item_1': item},callback=self.second_parse)
#对于返回的小类的url，再进行递归请求
def second_parse(self, response):
sel= Selector(response)
item_1= response.meta['item_1']
items= []
bigUrls= sel.xpath('//a/@href').extract()
for i in range(0, len(bigUrls)):
if_belong =bigUrls[i].endswith('.shtml') and bigUrls[i].startswith(item_1['parent_url'])
if(if_belong):
item = TutorialItem()
item['parent_title'] =item_1['parent_title']
item['parent_url'] =item_1['parent_url']
item['second_url'] =item_1['second_url']
item['second_title'] =item_1['second_title']
item['path'] = item_1['path']
item['link_url'] = bigUrls[i]
items.append(item)
for item in items:
yield Request(url=item['link_url'], meta={'item_2':item},callback=self.detail_parse)
def detail_parse(self, response):
sel= Selector(response)
item= response.meta['item_2']
content= ""
head=sel.xpath('//h1[@id=\"artibodyTitle\"]/text()').extract()
content_list=sel.xpath('//div[@id=\"artibody\"]/p/text()').extract()
for content_one in content_list:
content += content_one
item['head']= head
item['content']= content
yield item

2、pipelines.py

主要是对于抓取数据的保存（txt），这里把文件名命名为链接中'/'替换成'_'

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy import signals
import json
import codecs
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
class SinaPipeline(object):
def process_item(self, item, spider):
link_url = item['link_url']
file_name = link_url[7:-6].replace('/','_')
file_name += ".txt"
fp = open(item['path']+'/'+file_name, 'w')
fp.write(item['content'])
fp.close()
return item

3、setting.py

这是设置文件，这里需要设置同时开启的线程数目、日志打印的级别等

# -*- coding: utf-8 -*-
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
ITEM_PIPELINES = {
'tutorial.pipelines.SinaPipeline': 300,
}
LOG_LEVEL = 'INFO'
ROBOTSTXT_OBEY = True

爬取结果

这里的文件夹是根据分类，然后创建的；

这是大类的文件夹，现在我们已经将item都爬下来了，就需要存了，这里只想要存内容，所以直接将item里面的content字段的内容写入txt。

这里通过将链接进行处理，转换成文件名，最后保存到所属的那个类里；

Scrapy教程——搭建环境、创建项目、爬取内容、保存文件的更多相关文章

pymysql 使用twisted异步插入数据库：基于crawlspider爬取内容保存到本地mysql数据库
本文的前提是实现了整站内容的抓取,然后把抓取的内容保存到数据库. 可以参考另一篇已经实现整站抓取的文章:Scrapy 使用CrawlSpider整站抓取文章内容实现本文也是基于这篇文章代码基础上实现 ...
python爬虫之爬取糗事百科并将爬取内容保存至Excel中
本篇博文为使用python爬虫爬取糗事百科content并将爬取内容存入excel中保存·. 实验环境:Windows10 代码编辑工具:pycharm 使用selenium(自动化测试工具)+p ...
从零开始学Xamarin.Forms(二) 环境搭建、创建项目
原文:从零开始学Xamarin.Forms(二) 环境搭建.创建项目一.环境搭建 Windows下环境搭建: 1.下载并安装jdk.Android SDK和NDK,当然还需要 VS2013 ...
vue--1.环境搭建及创建项目
转自https://blog.csdn.net/junshangshui/article/details/80376489 一.环境搭建及创建项目 1.安装node.js,webpack 2.安装vu ...
Python爬虫教程-13-爬虫使用cookie爬取登录后的页面(人人网)（下）
Python爬虫教程-13-爬虫使用cookie爬取登录后的页面(下) 自动使用cookie的方法,告别手动拷贝cookie http模块包含一些关于cookie的模块,通过他们我们可以自动的使用co ...
python爬虫项目-爬取雪球网金融数据（关注、持续更新）
(一)python金融数据爬虫项目爬取目标:雪球网(起始url:https://xueqiu.com/hq#exchange=CN&firstName=1&secondName=1_ ...
【python爬虫】对喜马拉雅上一个专辑的音频进行爬取并保存到本地
>>>内容基本框架: 1.爬虫目的 2.爬取过程 3.代码实现 4.爬取结果 >>>实验环境: python3.6版本,pycharm,电脑可上网. [一爬虫目 ...
简单的爬虫爬的完整的<img>标签，修改正则即可修改爬取内容
简单的爬虫爬的完整的<img>标签,生成<img>标签结果文件与爬虫经历的网页. <?php/** 从给定的url获取html内容** */function _getUr ...
python爬虫爬取内容中，-xa0，-u3000的含义
python爬虫爬取内容中,-xa0,-u3000的含义 - CSDN博客 https://blog.csdn.net/aiwuzhi12/article/details/54866310

随机推荐

阿里云Ubuntu下tomcat8.5配置SSL证书
环境阿里云ubuntu(18.04)服务器阿里云申请的域名 Tomcat8.5.7 jdk1.8 免费型SSL证书 SSL证书申请登录阿里云的官网,登录后在菜单中选择SSL证书(应用安全) 进入 ...
Python3简易接口自动化测试框架设计与实现（中）
目录 7.Excel数据读取 7.1.读取配置文件 7.1.编写Excel操作类 8.用例组装 9.用例运行结果校验 10.运行用例 11 .小结上一篇:Python3简易接口自动化测试框架设计与实 ...
CAFFE（一）：Ubuntu 下安装CUDA（安装：NVIDIA-384+CUDA9.0+cuDNN7.1）
(安装:NVIDIA-384+CUDA9.0+cuDNN7.1) 显卡(GPU)驱动:NVIDIA-384 CUDA:CUDA9.0 cuDNN:cuDNN7.1 Ubuntu 下安装CUDA需要装N ...
python学习笔记：安装boost python库以及使用boost.python库封装
学习是一个累积的过程.在这个过程中,我们不仅要学习新的知识,还需要将以前学到的知识进行回顾总结. 前面讲述了Python使用ctypes直接调用动态库和使用Python的C语言API封装C函数, C+ ...
Linux下安装opencv（踩坑记录帖）
1.首先安装依赖项:sudo apt install build-essential sudo apt install build-essentialsudo apt install cmake gi ...
SQL Server 元数据分类
SQL Server 中维护了一组表用于存储 SQL Server 中所有的对象.数据类型.约束条件.配置选项.可用资源等信息,这些信息称为元数据信息(Metadata),而这些表称为系统基础表(Sy ...
Educational Codeforces Round 40 G. Castle Defense （二分+滑动数组+greedy）
G. Castle Defense time limit per test 1.5 seconds memory limit per test 256 megabytes input standard ...
java -为什么重写equals()，还需要重写hashCode()?
1.先post这两个方法的基本定义: equals()的定义: 浅谈Java中的equals和==(转) hashCode()的定义: java中hashCode()方法的作用 Java中hashCo ...
Mac安装chromedriver和geckodriver
DY@MacBook-Pro bin$brew install chromedriver Error: No available formula with the name "chromed ...
JavaScript教程——JavaScript 的基本语法（标识符）
标识符标识符(identifier)指的是用来识别各种值的合法名称.最常见的标识符就是变量名,以及后面要提到的函数名.JavaScript 语言的标识符对大小写敏感,所以a和A是两个不同的标识符. ...