python之scrapy篇(一)

一、首先创建工程(cmd中进行)

scrapy startproject xxx

二、编写Item文件

添加要字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class DoubanItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    # 电影标题

    title = scrapy.Field()

    # 电影信息

    info = scrapy.Field()

    # 电影评分

    score = scrapy.Field()

    # 评分人数

    number = scrapy.Field()

    # 简介

    content = scrapy.Field()

# 简介
content = scrapy.Field()

三、进入spider文件(cmd中进行)

scrapy genspider demo 'www.movie.douban.com'

创建完成后进入

编写代码

# -*- coding: utf-8 -*-

import scrapy

from douban.items import DoubanItem

class DoubanmovieSpider(scrapy.Spider):

    name = 'doubanmovie'

    allowed_domains = ['movie.douban.com']

    offset = 0

    url = "https://movie.douban.com/top250?start="

    start_urls = (

        url + str(offset),

    )

    def parse(self, response):

        item = DoubanItem()

        # 标题

        movies = response.xpath("//div[@class='info']")

        for movie in movies:

            name = movie.xpath('div[@class="hd"]/a/span/text()').extract()

            message = movie.xpath('div[@class="bd"]/p/text()').extract()

            star = movie.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()

            number = movie.xpath('div[@class="bd"]/div[@class="star"]/span/text()').extract()

            quote = movie.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()

            if quote:

                quote = quote[0]

            else:

                quote = ''

            item['title'] = ''.join(name)

            item['info'] = quote

            item['score'] = star[0]

            item['content'] = ';'.join(message).replace(' ', '').replace('\n', '')

            item['number'] = number[1].split('人')[0]

            yield item

        if self.offset < 225:

            self.offset += 25

            yield scrapy.Request(self.url + str(self.offset), callback=self.parse)

四、配置setting文件

# -*- coding: utf-8 -*-

# Scrapy settings for douban project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://docs.scrapy.org/en/latest/topics/settings.html

#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']

NEWSPIDER_MODULE = 'douban.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"

# Obey robots.txt rules

# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

# DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',

  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  # 'Accept-Language': 'en',

}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'douban.middlewares.DoubanSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

   # 'douban.middlewares.DoubanDownloaderMiddleware': 100,

   'douban.middlewares.RandomUserAgent': 100,

}

USER_AGENT = [

   # Opera

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",

   "Opera/8.0 (Windows NT 5.1; U; en)",

   "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",

   "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",

   # Firefox

   "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",

   "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",

   # Safari

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",

   # chrome

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",

   "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",

   "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",

   # 360

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",

   "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",

   # 淘宝浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",

   # 猎豹浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",

   "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",

   "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",

   # QQ浏览器

   "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",

   "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

   # sogou浏览器

   "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",

   "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",

   # maxthon浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",

   # UC浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",

]

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   # 'douban.pipelines.DoubanPipeline': 300,

   'douban.pipelines.DoubanSqlPipeline': 300,

   'douban.pipelines.DoubanWritePipeline': 250,

}

# 主机名

MYSQL_HOST = "IP"

# 端口号

MYSQL_PORT = 3306

# 数据库用户名

MYSQL_USER = "root"

# 数据库密码

MYSQL_PASSWORD = "Password"

# 数据库名称

MYSQL_DBNAME = "mydouban"

# 存放数据的表名称

MYSQL_TABLENAME = "doubanmovies"

# 数据库编码

MYSQL_CHARSET = "utf8"

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

配置管道文件

ITEM_PIPELINES = {
# 'douban.pipelines.DoubanPipeline': 300,
'douban.pipelines.DoubanSqlPipeline': 300,
'douban.pipelines.DoubanWritePipeline': 250,
}

User_Agent配置(包含了大多数浏览器)

DOWNLOADER_MIDDLEWARES = {

   # 'douban.middlewares.DoubanDownloaderMiddleware': 100,

   'douban.middlewares.RandomUserAgent': 100,

}

USER_AGENT = [

   # Opera

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",

   "Opera/8.0 (Windows NT 5.1; U; en)",

   "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",

   "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",

   # Firefox

   "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",

   "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",

   # Safari

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",

   # chrome

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",

   "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",

   "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",

   # 360

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",

   "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",

   # 淘宝浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",

   # 猎豹浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",

   "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",

   "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",

   # QQ浏览器

   "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",

   "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

   # sogou浏览器

   "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",

   "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",

   # maxthon浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",

   # UC浏览器

   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",

]

# 主机名
MYSQL_HOST = "IP"
# 端口号
MYSQL_PORT = 3306
# 数据库用户名
MYSQL_USER = "root"
# 数据库密码
MYSQL_PASSWORD = "password"
# 数据库名称
MYSQL_DBNAME = "mydouban"
# 存放数据的表名称
MYSQL_TABLENAME = "doubanmovies"
# 数据库编码
MYSQL_CHARSET = "utf8"

五、管道文件pipelines

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

import pymysql

from scrapy import settings

import json

import logging

from pymysql import cursors

from twisted.enterprise import adbapi

import time

import copy

#保存到本地

class DoubanWritePipeline(object):

    def __init__(self):

        self.filename = open("douban.json", "wb")

    def process_item(self, item, spider):

        text = json.dumps(dict(item), ensure_ascii=False) + "\n"

        self.filename.write(text.encode("utf-8"))

        return item

    def colse_spider(self, spider):

        self.filename.close()

#保存到mysql(提前建立好数据库)

class DoubanSqlPipeline(object):

    def __init__(self):

        self.conn = pymysql.connect(host='IP', user='root',

                               passwd='password', db='mydouban', charset='utf8')

        self.cur = self.conn.cursor()

    def process_item(self, item, spider):

        title = item["title"]

        info = item["info"]

        score = item["score"]

        number = item["number"]

        content = item["content"]

        # 创建sql语句

        sql = """INSERT INTO doubanmovies (title,info,score,number,content,createtime) VALUES ("{}","{}","{}","{}","{}","{}")""".format(

            pymysql.escape_string(title), pymysql.escape_string(info), pymysql.escape_string(score), pymysql.escape_string(number), pymysql.escape_string(content),pymysql.escape_string(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))

        # 执行sql语句

        self.conn.ping(reconnect=True)

        self.cur.execute(sql)

        self.conn.commit()

    def close_spider(self, spider):

        self.cur.close()

        self.conn.close()

运行

scrapy crawl xxx

OVER！

下面展示一下效果...

python之scrapy篇(一)的更多相关文章

python之scrapy篇(三)
一.创建工程(cmd) scrapy startproject xxxx 二.编写item文件 # -*- coding: utf-8 -*- # Define here the models for ...
python之scrapy篇(二)
一.创建工程 scarpy startproject xxx 二.编写iteam文件 # -*- coding: utf-8 -*- # Define here the models for your ...
Python爬虫Scrapy框架入门（0）
想学习爬虫,又想了解python语言,有个python高手推荐我看看scrapy. scrapy是一个python爬虫框架,据说很灵活,网上介绍该框架的信息很多,此处不再赘述.专心记录我自己遇到的问题 ...
[Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍
前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...
python爬虫Scrapy(一)-我爬了boss数据
一.概述学习python有一段时间了,最近了解了下Python的入门爬虫框架Scrapy,参考了文章Python爬虫框架Scrapy入门.本篇文章属于初学经验记录,比较简单,适合刚学习爬虫的小伙伴. ...
Python之Scrapy爬虫框架安装及简单使用
题记:早已听闻python爬虫框架的大名.近些天学习了下其中的Scrapy爬虫框架,将自己理解的跟大家分享.有表述不当之处,望大神们斧正. 一.初窥Scrapy Scrapy是一个为了爬取网站数据,提 ...
dota玩家与英雄契合度的计算器，python语言scrapy爬虫的使用
首发:个人博客,更新&纠错&回复演示地址在这里,代码在这里. 一个dota玩家与英雄契合度的计算器(查看效果),包括两部分代码: 1.python的scrapy爬虫,总体思路是pag ...
python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)
操作环境:python3 在上一文中python爬虫scrapy框架--人工识别知乎登录知乎倒立文字验证码和数字英文验证码(1)我们已经介绍了用Requests库来登录知乎,本文如果看不懂可以先看之前 ...
python爬虫scrapy项目详解（关注、持续更新）
python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...

随机推荐

关闭Win10窗口拖动到桌面边缘自动缩放功能
关于Django的序列化问题。serializers
在DRF框架里,ModelSerializers是一个重要的组件.大大的帮组我们节省了数据序列化的过程,真的可以说是良心产品.接手的这个项目用的Django,前人的代码都是手动序列化的,为了保证风格的 ...
Typescript + React 高仿 Antd 从零到一打造自己的组件库(完整)
买了张轩老师的课程,感觉很不错,适用于高级进阶,老师讲的通俗易懂,欢迎讨论学习.WX:Jujiu_i
【Vue】 axios同步执行多个请求
问题项目中遇到一个需求,在填写商品的时候,选择商品分类后,加载出商品分类的扩展属性. 这个扩展属性有可能是自定义的数据字典里的单选/多远. 要用第一个axios查询扩展属性,第二个axios 从第一 ...
Javascrip之BOM
重点内容理解windows对象-BOM核心控制窗口.框架.弹出窗口利用location对象中的页面信息利用navigator对象了解浏览器 BOM:浏览器对象模型[Browner Object ...
Linux 服务分类
一,服务分类 1,服务简介与分类 1,服务的分类启动与自启动 1,服务启动:就是在当前系统中让服务运行,并提供功能 2,服务自启动:自启动是指让服务在系统开机或重启动之后,随着系统的启动而自动启动的 ...
Python 反序列化漏洞学习笔记
参考文章一篇文章带你理解漏洞之 Python 反序列化漏洞 Python Pickle/CPickle 反序列化漏洞 Python反序列化安全问题 pickle反序列化初探前言上面看完,请忽略下 ...
springboot使用mybatis拦截进行SQL分页
新建一个类MyPageInterceptor.java(注意在springboot中要添加注解@Component) package com.grand.p1upgrade.mapper.test; ...
验证pdf文件的电子章签名
pom.xml <?xml version="1.0" encoding="UTF-8"?> <project xmlns="htt ...
初入Nginx--配置篇
Nginx的主配置文件为/path/to/nginx/nginx.conf.Nginx.conf的配置文件结构主要由以下几个部分组成: ..... events{ .... } http{ .... ...

python之scrapy篇(一)

python之scrapy篇(一)的更多相关文章

随机推荐

热门专题