1、创建工程

scrapy startproject tencent

2、创建项目

scrapy genspider mahuateng

3、既然保存到数据库,自然要安装pymsql

pip install pymysql

4、settings文件,配置信息,包括数据库等

# -*- coding: utf-8 -*-

# Scrapy settings for tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'tencent' SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders' LOG_LEVEL="WARNING"
LOG_FILE="./qq.log"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' # Obey robots.txt rules
#ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'tencent.middlewares.TencentSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'tencent.middlewares.TencentDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'tencent.pipelines.TencentPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' # 连接数据MySQL
# 数据库地址
MYSQL_HOST = 'localhost'
# 数据库用户名:
MYSQL_USER = 'root'
# 数据库密码
MYSQL_PASSWORD = 'yang156122'
# 数据库端口
MYSQL_PORT = 3306
# 数据库名称
MYSQL_DBNAME = 'test'
# 数据库编码
MYSQL_CHARSET = 'utf8'

5、items.py文件定义数据字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html import scrapy class TencentItem(scrapy.Item):
"""
数据字段定义
"""
postId = scrapy.Field()
recruitPostId = scrapy.Field()
recruitPostName = scrapy.Field()
countryName = scrapy.Field()
locationName = scrapy.Field()
categoryName = scrapy.Field()
lastUpdateTime = scrapy.Field() pass

6、mahuateng.py文件主要是抓取数据

# -*- coding: utf-8 -*-
import scrapy import json
import logging
class MahuatengSpider(scrapy.Spider):
name = 'mahuateng'
allowed_domains = ['careers.tencent.com']
start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1561688387174&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40003&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
pageNum = 1
def parse(self, response):
"""
数据获取
:param response:
:return:
"""
content = response.body.decode()
content = json.loads(content)
content=content['Data']['Posts']
#删除空字典
for con in content:
#print(con)
for key in list(con.keys()):
if not con.get(key):
del con[key]
#记录每一个岗位信息
# for con in content:
# yield con
#print(type(con))
yield con
#logging.warning(con) #####翻页######
self.pageNum = self.pageNum+1
if self.pageNum<=118:
next_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1561688387174&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40003&attrId=&keyword=&pageIndex="+str(self.pageNum)+"&pageSize=10&language=zh-cn&area=cn"
yield scrapy.Request(
next_url,
callback=self.parse
)

7、pipelines.py文件主要是对数据进行处理,包括将数据存储到mysql

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import logging from pymysql import cursors
from twisted.enterprise import adbapi
import time
# from tencent.settings import MYSQL_HOST
# from tencent.settings import MYSQL_USER
# from tencent.settings import MYSQL_PASSWORD
# from tencent.settings import MYSQL_PORT
# from tencent.settings import MYSQL_DBNAME
#
# from tencent.settings import MYSQL_CHARSET
import copy
class TencentPipeline(object):
#函数初始化
def __init__(self,db_pool):
self.db_pool=db_pool @classmethod
def from_settings(cls,settings):
"""类方法,只加载一次,数据库初始化"""
db_params = dict(
host=settings['MYSQL_HOST'],
user=settings['MYSQL_USER'],
password=settings['MYSQL_PASSWORD'],
port=settings['MYSQL_PORT'],
database=settings['MYSQL_DBNAME'],
charset=settings['MYSQL_CHARSET'],
use_unicode=True,
# 设置游标类型
cursorclass=cursors.DictCursor
)
# 创建连接池
db_pool = adbapi.ConnectionPool('pymysql', **db_params)
# 返回一个pipeline对象
return cls(db_pool) def process_item(self, item, spider):
"""
数据处理
:param item:
:param spider:
:return:
"""
myItem={}
myItem["postId"] = item["PostId"]
myItem["recruitPostId"] = item["RecruitPostId"]
myItem["recruitPostName"] = item["RecruitPostName"]
myItem["countryName"] = item["CountryName"]
myItem["locationName"] = item["LocationName"]
myItem["categoryName"] = item["CategoryName"]
myItem["lastUpdateTime"] = item["LastUpdateTime"]
logging.warning(myItem)
# 对象拷贝,深拷贝 --- 这里是解决数据重复问题!!!
asynItem = copy.deepcopy(myItem) # 把要执行的sql放入连接池
query = self.db_pool.runInteraction(self.insert_into, asynItem) # 如果sql执行发送错误,自动回调addErrBack()函数
query.addErrback(self.handle_error, myItem, spider)
return myItem # 处理sql函数
def insert_into(self, cursor, item):
# 创建sql语句
sql = "INSERT INTO tencent (postId,recruitPostId,recruitPostName,countryName,locationName,categoryName,lastUpdateTime) VALUES ('{}','{}','{}','{}','{}','{}','{}')".format(
item['postId'], item['recruitPostId'], item['recruitPostName'], item['countryName'], item['locationName'],
item['categoryName'],item['lastUpdateTime'])
# 执行sql语句
cursor.execute(sql)
# 错误函数 def handle_error(self, failure, item, spider):
# #输出错误信息
print("failure", failure)

8、创建数据库表

Navicat MySQL Data Transfer

Source Server         : 本机
Source Server Version : 50519
Source Host : localhost:3306
Source Database : test Target Server Type : MYSQL
Target Server Version : 50519
File Encoding : 65001 Date: 2019-06-28 12:47:06
*/ SET FOREIGN_KEY_CHECKS=0; -- ----------------------------
-- Table structure for tencent
-- ----------------------------
DROP TABLE IF EXISTS `tencent`;
CREATE TABLE `tencent` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`postId` varchar(100) DEFAULT NULL,
`recruitPostId` varchar(100) DEFAULT NULL,
`recruitPostName` varchar(100) DEFAULT NULL,
`countryName` varchar(100) DEFAULT NULL,
`locationName` varchar(100) DEFAULT NULL,
`categoryName` varchar(100) DEFAULT NULL,
`lastUpdateTime` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1181 DEFAULT CHARSET=utf8;

完美收官!

python之scrapy爬取数据保存到mysql数据库的更多相关文章

  1. Python scrapy爬虫数据保存到MySQL数据库

    除将爬取到的信息写入文件中之外,程序也可通过修改 Pipeline 文件将数据保存到数据库中.为了使用数据库来保存爬取到的信息,在 MySQL 的 python 数据库中执行如下 SQL 语句来创建 ...

  2. 爬取伯乐在线文章(四)将爬取结果保存到MySQL

    Item Pipeline 当Item在Spider中被收集之后,它将会被传递到Item Pipeline,这些Item Pipeline组件按定义的顺序处理Item. 每个Item Pipeline ...

  3. 5分钟掌握智联招聘网站爬取并保存到MongoDB数据库

    前言 本次主题分两篇文章来介绍: 一.数据采集 二.数据分析 第一篇先来介绍数据采集,即用python爬取网站数据. 1 运行环境和python库 先说下运行环境: python3.5 windows ...

  4. python爬取数据保存到Excel中

    # -*- conding:utf-8 -*- # 1.两页的内容 # 2.抓取每页title和URL # 3.根据title创建文件,发送URL请求,提取数据 import requests fro ...

  5. 关于爬取数据保存到json文件,中文是unicode解决方式

    流程: 爬取的数据处理为列表,包含字典.里面包含中文, 经过json.dumps,保存到json文件中, 发现里面的中文显示未\ue768这样子 查阅资料发现,json.dumps 有一个参数.ens ...

  6. 爬取网贷之家平台数据保存到mysql数据库

    # coding utf-8 import requests import json import datetime import pymysql user_agent = 'User-Agent: ...

  7. Python+Scrapy+Crawlspider 爬取数据且存入MySQL数据库

    1.Scrapy使用流程 1-1.使用Terminal终端创建工程,输入指令:scrapy startproject ProName 1-2.进入工程目录:cd ProName 1-3.创建爬虫文件( ...

  8. Java爬取51job保存到MySQL并进行分析

    大二下实训课结业作业,想着就爬个工作信息,原本是要用python的,后面想想就用java试试看, java就自学了一个月左右,想要锻炼一下自己面向对象的思想等等的, 然后网上转了一圈,拉钩什么的是动态 ...

  9. 如何将大数据保存到 MySql 数据库

    1. 什么是大数据 1. 所谓大数据, 就是大的字节数据,或大的字符数据. 2. 标准 SQL 中提供了如下类型来保存大数据类型: 字节数据类型: tinyblob(256B), blob(64K), ...

随机推荐

  1. cnblogs排版样式预览

    说明:关于本博主题及样式来源于[GitHub]:本博总体排版目录样式风格参照博文[修仙成神之路]进行预览:参照本博设置可参考博文[设置跟本博一样的效果]本博之前发表过的博文存在样式不协调,后期会逐一完 ...

  2. Ajax返回数据却一直进入error(已经解决)

    做asp.net项目  使用ajax $.ajax({ url: '../Music/Default2.aspx?Types=' + type + '&texts=' + text + '', ...

  3. svn中日志不展示解决方法记录

    一,问题:点击svn查看日志时不显示.筛选时间显示为1970 1,猜想可能没有查看日志权限 2,查看linux 下svn版本库 confg 下三个配制文件 authz ,passwd ,svnserv ...

  4. Hadoop_20_MapReduce程序的运行模式

    1.MapReduce程序的运行模式 1. Windows中运行MapReduce程序 (1)mapreduce程序是被提交给LocalJobRunner在本地以单进程的形式运行 (2)而处理的数据及 ...

  5. Maven 安装依赖包

    Guide to installing 3rd party JARs Although rarely, but sometimes you will have 3rd party JARs that ...

  6. 最小m子段和(动态规划)

    问题描述: 给定n个整数组成的序列,现在要求将序列分割为m段,每段子序列中的数在原序列中连续排列.如何分割才能使这m段子序列的和的最大值达到最小? 输入格式: 第一行给出n,m,表示有n个数分成m段, ...

  7. CCPC2019厦门站游记

    day0 人生第二次坐飞机,快乐 出机场到厦门,周围短袖短裤,我秋衣秋裤,快乐 宾馆条件不错,快乐 day1 早上休闲模式,整版子写模板题,快乐 宾馆有露天平台,有桌椅,有风,快乐 中午报道,领衣服, ...

  8. Derby 数据库 客户端 ij使用

    Derby是开源的.嵌入式的Java数据库程序,ij是Derby提供的客户端工具,相当于其他数据库提供的sqlplus工具. ij是纯Java的程序,不用安装,使用起来就像运行普通的Java应用程序一 ...

  9. Chrome禁用隐藏www和m

    解决方案 打开chrome://flags 启动控制台输入并执行以下内容 [ 'omnibox-ui-hide-steady-state-url-path-query-and-ref', 'omnib ...

  10. Vivado添加coe文件

    直接将.txt文件的后缀改为.coe,并在文件的开头添加如下两行代码即可: memory_initialization_radix=10; memory_initialization_vector=