python框架Scrapy中crawlSpider的使用—

一、先在MySQL中创建test数据库，和相应的site数据表

二、创建Scrapy工程

#scrapy startproject 工程名

scrapy startproject demo4

三、进入工程目录，根据爬虫模板生成爬虫文件

#scrapy genspider -l # 查看可用模板

#scrapy genspider -t 模板名 爬虫文件名 允许的域名

scrapy genspider -t crawl test sohu.com

四、设置IP池或用户代理（middlewares.py文件）

 # -*- coding: utf-8 -*-

 # 导入随机模块

 import random

 # 导入有关IP池有关的模块

 from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware

 # 导入有关用户代理有关的模块

 from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

 # IP池

 class HTTPPROXY(HttpProxyMiddleware):

     # 初始化 注意一定是 ip=''

     def __init__(self, ip=''):

         self.ip = ip

     def process_request(self, request, spider):

         item = random.choice(IPPOOL)

         try:

             print("当前的IP是："+item["ipaddr"])

             request.meta["proxy"] = "http://"+item["ipaddr"]

         except Exception as e:

             print(e)

             pass

 # 设置IP池

 IPPOOL = [

     {"ipaddr": "182.117.102.10:8118"},

     {"ipaddr": "121.31.102.215:8123"},

     {"ipaddr": "1222.94.128.49:8118"}

 ]

 # 用户代理

 class USERAGENT(UserAgentMiddleware):

     #初始化 注意一定是 user_agent=''

     def __init__(self, user_agent=''):

         self.user_agent = user_agent

     def process_request(self, request, spider):

         item = random.choice(UPPOOL)

         try:

             print("当前的User-Agent是："+item)

             request.headers.setdefault('User-Agent', item)

         except Exception as e:

             print(e)

             pass

 # 设置用户代理池

 UPPOOL = [

     "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393"

 ]

五、settngs.py配置

 COOKIES_ENABLED = False

 DOWNLOADER_MIDDLEWARES = {

     # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':123,

     # 'demo4.middlewares.HTTPPROXY' : 125,

     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 2,

     'demo4.middlewares.USERAGENT': 1

 }

 ITEM_PIPELINES = {

     'demo4.pipelines.Demo4Pipeline': 300,

 }

六、定义爬取关注的数据（items.py文件）

 # -*- coding: utf-8 -*-

 import scrapy

 # Define here the models for your scraped items

 #

 # See documentation in:

 # http://doc.scrapy.org/en/latest/topics/items.html

 class Demo4Item(scrapy.Item):

     name = scrapy.Field()

     link = scrapy.Field()

七、爬虫文件编写（test.py）

 # -*- coding: utf-8 -*-

 import scrapy

 from scrapy.linkextractors import LinkExtractor

 from scrapy.spiders import CrawlSpider, Rule

 from demo4.items import Demo4Item

 class TestSpider(CrawlSpider):

     name = 'test'

     allowed_domains = ['sohu.com']

     start_urls = ['http://www.sohu.com/']

     rules = (

         Rule(LinkExtractor(allow=('http://news.sohu.com'), allow_domains=('sohu.com')), callback='parse_item',

              follow=False),

         # Rule(LinkExtractor(allow=('.*?/n.*?shtml'),allow_domains=('sohu.com')), callback='parse_item', follow=False),

     )

     def parse_item(self, response):

         i = Demo4Item()

         i['name'] = response.xpath('//div[@class="news"]/h1/a/text()').extract()

         i['link'] = response.xpath('//div[@class="news"]/h1/a/@href').extract()

         #i['description'] = response.xpath('//div[@id="description"]').extract()

         return i

八、管道文件编写（pipelines.py）

 # -*- coding: utf-8 -*-

 import pymysql

 import json

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

 class Demo4Pipeline(object):

     def __init__(self):

         # 数据库连接

         self.conn = pymysql.connect(host='localhost', user='root', password='', database='chapter17', charset='utf8')

         self.cur = self.conn.cursor()

     def process_item(self, item, spider):

         # 排除空值

         for j in range(0, len(item["name"])):

             nam = item["name"][j]

             lin = item["link"][j]

             print(type(nam))

             print(type(lin))

             # 注意参数化编写

             sql = "insert into site(name,link) values(%s,%s)"

             self.cur.execute(sql,(nam,lin))

             self.conn.commit()

         return item

     def close_spider(self, spider):

         self.cur.close()

         self.conn.close()

九、总结

1.注意在测试完数据库正常运行时，再开始写入数据，当然，在sql参数化处理的过程中，注意格式，千万不要弄错了

python框架Scrapy中crawlSpider的使用——爬取内容写进MySQL的更多相关文章

python框架Scrapy中crawlSpider的使用
一.创建Scrapy工程 #scrapy startproject 工程名 scrapy startproject demo3 二.进入工程目录,根据爬虫模板生成爬虫文件 #scrapy genspi ...
scrapy中使用selenium来爬取页面
scrapy中使用selenium来爬取页面 from selenium import webdriver from scrapy.http.response.html import HtmlResp ...
Scrapy 框架 CrawlSpider 全站数据爬取
CrawlSpider 全站数据爬取创建 crawlSpider 爬虫文件 scrapy genspider -t crawl chouti www.xxx.com import scrapy fr ...
scrapy框架之CrawlSpider全站自动爬取
全站数据爬取的方式 1.通过递归的方式进行深度和广度爬取全站数据,可参考相关博文(全站图片爬取),手动借助scrapy.Request模块发起请求. 2.对于一定规则网站的全站数据爬取,可以使用Cra ...
scrapy进阶（CrawlSpider爬虫__爬取整站小说）
# -*- coding: utf-8 -*- import scrapy,re from scrapy.linkextractors import LinkExtractor from scrapy ...
python爬虫之爬取糗事百科并将爬取内容保存至Excel中
本篇博文为使用python爬虫爬取糗事百科content并将爬取内容存入excel中保存·. 实验环境:Windows10 代码编辑工具:pycharm 使用selenium(自动化测试工具)+p ...
python爬虫爬取内容中，-xa0，-u3000的含义
python爬虫爬取内容中,-xa0,-u3000的含义 - CSDN博客 https://blog.csdn.net/aiwuzhi12/article/details/54866310
Crawlspider的自动爬取
引子 : 如果想要爬取糗事百科的全栈数据的方法 ? 方法一 : 基于scrapy框架中的scrapy的递归爬取进行实现(requests模块递归回调parse方法) . 方法二 : 基于Crawl ...
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息
[Python爬虫] 使用 Beautiful Soup 4 快速爬取所需的网页信息 2018-07-21 23:53:02 larger5 阅读数 4123更多分类专栏: 网络爬虫版权声明: ...

随机推荐

关闭MongoDB
以下方法干净地关闭MongoDB: 完成所有挂起的操作.刷新数据到数据文件.关闭所有的数据文件 1. > use admin switched to db admin > db.shutd ...
convertView与ViewHolder有什么区别，好处在哪里
convertView 在API中的解释是The old view to reuse, if possible, 第一次getView时还没有convertView,这时你便创建了一个新的vi ...
hdu1010Tempter of the Bone(dfs+奇偶剪枝)
题目链接:pid=1010">点击打开链接题目描写叙述:给定一个迷宫,给一个起点和一个终点.问是否能恰好经过T步到达终点?每一个格子不能反复走解题思路:dfs+剪枝剪枝1:奇偶剪 ...
HTTPS证书申请相关笔记
申请免费的HTTPS证书相关资料参考资料: HTTPS 检测苹果ATS检测什么是ECC证书? 渠道2: Let's Encrypt 优点缺点 Let's Encrypt 的是否支持非80,44 ...
void *指针的加减运算
1.手工写了一个程序验证void *指针加减运算移动几个字节: //本程序验证空类型指针减1移动几个字节 #include <stdio.h> int main(int argc, cha ...
bashrc,bash_profile和/etc/profile
bashrc,bash_profile和/etc/profile 最近老出现在shell里面能跑的程序用鼠标双击app去不能跑.究其原因是因为环境变量的问题. 在类unix系统中一般有三个bash配置 ...
C#元祖Tuple的事例
数组合并了同样类型的对象.而元祖合并了不同类型的对象.元祖起源于函数编程语言(F#) NET Framework定义了8个泛型Tuple(自NET4.0)和一个静态的Tuple类,他们作用元祖的工厂, ...
浅谈P2P终结者原理及其突破
P2P终结者按正常来说是个很好的网管软件,但是好多人却拿它来,恶意的限制他人的流量,使他人不能正常上网,下面我们就他的功能以及原理还有突破方法做个详细的介绍! 我们先来看看来自在网上PSP的资料:P2 ...
《The Joy of X》
来到园子已经几个月了,平时也就看看新闻.招聘信息,偶尔也会看看技术文章.作为一名非计算机专业的学生,我深深地被技术的魅力所吸引.就在半个多月前,我开通了自己的博客,以便记录自己的成长经历,也能与园子里 ...
SPI—读写串行 FLASH
SPI协议简介SPI 协议是由摩托罗拉公司提出的通讯协议(Serial Peripheral Interface),即串行外围设备接口,是一种高速全双工的通信总线.它被广泛地使用在 ADC. LCD ...

python框架Scrapy中crawlSpider的使用——爬取内容写进MySQL

python框架Scrapy中crawlSpider的使用——爬取内容写进MySQL的更多相关文章

随机推荐

热门专题