精通python网络爬虫之自动爬取网页的爬虫代码记录

items的编写

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # https://doc.scrapy.org/en/latest/topics/items.html

 import scrapy

 class AutopjtItem(scrapy.Item):

     # define the fields for your item here like:

     # 用来存储商品名

     name = scrapy.Field()

     #用来存储商品价格

     price = scrapy.Field()

     # 用来存储商品链接

     link = scrapy.Field()

     # 用来存储商品评论数

     comnum = scrapy.Field()

     # 用来存储商品评论内容链接

     comnum_link = scrapy.Field()

piplines的编写

 # -*- coding: utf-8 -*-

 import codecs

 import json

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 class AutopjtPipeline(object):

     def __init__(self):

         self.file = codecs.open("D:/git/learn_scray/day11/1.json", "wb", encoding="utf-8")

     def process_item(self, item, spider):

         # 爬取当前页的所有信息

         for i in range(len(item["name"])):

             name = item["name"][i]

             price = item["price"][i]

             link = item["link"][i]

             comnum = item["comnum"][i]

             comnum_link = item["comnum_link"][i]

             current_conent = {"name":name,"price":price,"link":link,

                               "comnum":comnum,"comnum_link":comnum_link}

             j = json.dumps(dict(current_conent),ensure_ascii=False)

             # 为每条数据添加换行

             line = j + '\n'

             print(line)

             self.file.write(line)

         # for key,value in current_conent.items():

         #     print(key,value)

         return item

     def close_spider(self,spider):

         self.file.close()

自动爬虫编写实战

 # -*- coding: utf-8 -*-

 import scrapy

 from autopjt.items import AutopjtItem

 from scrapy.http import Request

 class AutospdSpider(scrapy.Spider):

     name = 'autospd'

     allowed_domains = ['dangdang.com']

     # 当当地方特产

     start_urls = ['http://category.dangdang.com/pg1-cid10010056.html']

     def parse(self, response):

         item = AutopjtItem()

         print("进入item")

         # print("获取标题:")

         # 获取标题

         item["name"] = response.xpath("//p[@class='name']/a/@title").extract()

         # print(title)

         # print("获取价格:")

         # 价格

         item["price"] = response.xpath("//span[@class='price_n']/text()").extract()

         # print(price)

         # print("获取商品链接:")

         # 获取商品链接

         item["link"] = response.xpath("//p[@class='name']/a/@href").extract()

         # print(link)

         # print("\n")

         # print("获取商品评论数:")

         # 获取商品评论数

         item["comnum"] = response.xpath("//a[@name='itemlist-review']/text()").extract()

         # comnum = response.xpath("//a[@name='itemlist-review']/text()").extract()

         # print(comnum)

         # print("获取商品评论数链接:")

         # 获取商品评论数链接

         item["comnum_link"] = response.xpath("//a[@name='itemlist-review']/@href").extract()

         # comnum_link = response.xpath("//a[@name='itemlist-review']/@href").extract()

         # print(comnum_link)

         yield item

         for i in range(1,79):

             # print(i)

             url = "http://category.dangdang.com/pg"+ str(i) + "-cid10010056.html"

             # print(url)

             yield Request(url, callback=self.parse)

yield详解:

　https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do

settings的设置:

 # -*- coding: utf-8 -*-

 # Scrapy settings for autopjt project

 #

 # For simplicity, this file contains only settings considered important or

 # commonly used. You can find more settings consulting the documentation:

 #

 #     https://doc.scrapy.org/en/latest/topics/settings.html

 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

 BOT_NAME = 'autopjt'

 SPIDER_MODULES = ['autopjt.spiders']

 NEWSPIDER_MODULE = 'autopjt.spiders'

 # Crawl responsibly by identifying yourself (and your website) on the user-agent

 #USER_AGENT = 'autopjt (+http://www.yourdomain.com)'

 # Obey robots.txt rules

 # 默认为true遵守robots.txt协议 我试了一下能爬 为了保险设置为false

 ROBOTSTXT_OBEY = True

 # Configure maximum concurrent requests performed by Scrapy (default: 16)

 #CONCURRENT_REQUESTS = 32

 # Configure a delay for requests for the same website (default: 0)

 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

 # See also autothrottle settings and docs

 #DOWNLOAD_DELAY = 3

 # The download delay setting will honor only one of:

 #CONCURRENT_REQUESTS_PER_DOMAIN = 16

 #CONCURRENT_REQUESTS_PER_IP = 16

 # Disable cookies (enabled by default)

 COOKIES_ENABLED = False

 # Disable Telnet Console (enabled by default)

 #TELNETCONSOLE_ENABLED = False

 # Override the default request headers:

 #DEFAULT_REQUEST_HEADERS = {

 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

 #   'Accept-Language': 'en',

 #}

 # Enable or disable spider middlewares

 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

 #SPIDER_MIDDLEWARES = {

 #    'autopjt.middlewares.AutopjtSpiderMiddleware': 543,

 #}

 # Enable or disable downloader middlewares

 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

 #DOWNLOADER_MIDDLEWARES = {

 #    'autopjt.middlewares.AutopjtDownloaderMiddleware': 543,

 #}

 # Enable or disable extensions

 # See https://doc.scrapy.org/en/latest/topics/extensions.html

 #EXTENSIONS = {

 #    'scrapy.extensions.telnet.TelnetConsole': None,

 #}

 # Configure item pipelines

 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 ITEM_PIPELINES = {

    'autopjt.pipelines.AutopjtPipeline': 300,

 }

 # Enable and configure the AutoThrottle extension (disabled by default)

 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html

 #AUTOTHROTTLE_ENABLED = True

 # The initial download delay

 #AUTOTHROTTLE_START_DELAY = 5

 # The maximum download delay to be set in case of high latencies

 #AUTOTHROTTLE_MAX_DELAY = 60

 # The average number of requests Scrapy should be sending in parallel to

 # each remote server

 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 # Enable showing throttling stats for every response received:

 #AUTOTHROTTLE_DEBUG = False

 # Enable and configure HTTP caching (disabled by default)

 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

 #HTTPCACHE_ENABLED = True

 #HTTPCACHE_EXPIRATION_SECS = 0

 #HTTPCACHE_DIR = 'httpcache'

 #HTTPCACHE_IGNORE_HTTP_CODES = []

 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后的效果:

精通python网络爬虫之自动爬取网页的爬虫代码记录的更多相关文章

python爬取网页的通用代码框架
python爬取网页的通用代码框架: def getHTMLText(url):#参数code缺省值为‘utf-8’(编码方式) try: r=requests.get(url,timeout=30) ...
python（27）requests 爬取网页乱码，解决方法
最近遇到爬取网页乱码的情况,找了好久找到了种解决的办法: html = requests.get(url,headers = head) html.apparent_encoding html.enc ...
【Python网络爬虫三】爬取网页新闻
学弟又一个自然语言处理的项目,需要在网上爬一些文章,然后进行分词,刚好牛客这周的是从一个html中找到正文,就实践了一下.写了一个爬门户网站新闻的程序需求: 从门户网站爬取新闻,将新闻标题,作者,时 ...
爬虫-----selenium模块自动爬取网页资源
selenium介绍与使用 1 selenium介绍什么是selenium?selenium是Python的一个第三方库,对外提供的接口可以操作浏览器,然后让浏览器完成自动化的操作. sel ...
python爬虫学习(7) —— 爬取你的AC代码
上一篇文章中,我们介绍了python爬虫利器--requests,并且拿HDU做了小测试. 这篇文章,我们来爬取一下自己AC的代码. 1 确定ac代码对应的页面如下图所示,我们一般情况可以通过该顺序 ...
[原创]python爬虫之BeautifulSoup,爬取网页上所有图片标题并存储到本地文件
from bs4 import BeautifulSoup import requests import re import os r = requests.get("https://re. ...
爬虫系列----scrapy爬取网页初始
一基本流程创建工程,工程名称为(cmd):firstblood: scrapy startproject firstblood 进入工程目录中(cmd):cd :./firstblood 创建爬虫 ...
Python学习--两种方法爬取网页图片(requests/urllib)
实际上,简单的图片爬虫就三个步骤: 获取网页代码使用正则表达式,寻找图片链接下载图片链接资源到电脑下面以博客园为例子,不同的网站可能需要更改正则表达式形式. requests版本: import ...
《精通python网络爬虫》笔记
<精通python网络爬虫>韦玮著目录结构第一章什么是网络爬虫第二章爬虫技能概览第三章爬虫实现原理与实现技术第四章 Urllib库与URLError异常处理第五章正则 ...

随机推荐

velocity生成静态页面代码
首先需要必备的jar包: web.xml  <servlet> <servlet-name>veloc ...
一. python基础知识
第一章.变量与判断语句 1.第一个python程序 # -*- coding:utf-8 -*- # Author: Raymond print ("hello world") p ...
ASIHTTPRequest简单学习
ASIHTTPRequest框架是优秀的第三方Objective-C的HTTP框架,支持Mac OS X和iOS下的HTTP开发. 一.ASIHTTPRequest框架的安装和配置 (1)首先要在项目 ...
CSS3-transform-style
transform-style属性 transform-style属性是3D空间一个重要属性,指定嵌套元素如何在3D空间中呈现.他主要有两个属性值:flat和preserve-3d. transfor ...
【支付宝支付】扫码付和app支付，回调验证签名失败问题
在检查了参数排序,编码解码,文件编码等问题后,发现还是签名失败,最后找出原因: 扫码付和app支付采用的支付宝公钥不一样 Pid和公钥管理里面: 开放平台密钥界面和开放平台应用界面的密钥应该一 ...
u-boot顶层Makefile分析
1.u-boot制作命令 make forlinx_nand_ram256_config: make all; 2.顶层mkconfig分析,参考 U-BOOT顶层目录mkconfig分析 mkcon ...
nw335 debian sid x86-64 -- 5 使用xp的驱动
nw335 debian sid x86-64 -- 5 使用xp的驱动
POJ 2955 区间DP Brackets
求一个括号的最大匹配数,这个题可以和UVa 1626比较着看. 注意题目背景一样,但是所求不一样. 回到这道题上来,设d(i, j)表示子序列Si ~ Sj的字符串中最大匹配数,如果Si 与 Sj能配 ...
poj2104&&poj2761 （主席树&&划分树）主席树静态区间第k大模板
K-th Number Time Limit: 20000MS Memory Limit: 65536K Total Submissions: 43315 Accepted: 14296 Ca ...
redis中redis.conf配置文件
redis.conf文件配置解释 1. Redis默认不是以守护进程的方式运行,可以通过该配置项修改,使用yes启用守护进程 daemonize yes 2. 当Redis以守护进程方式运行时,Red ...

精通python网络爬虫之自动爬取网页的爬虫 代码记录

items的编写

piplines的编写

自动爬虫编写实战

yield详解:

settings的设置:

最后的效果:

精通python网络爬虫之自动爬取网页的爬虫 代码记录的更多相关文章

随机推荐

热门专题

精通python网络爬虫之自动爬取网页的爬虫代码记录

精通python网络爬虫之自动爬取网页的爬虫代码记录的更多相关文章