作为

https://github.com/fanqingsong/web_full_stack_application

子项目的一功能的核心部分，使用scrapy抓取数据，解析完的数据，使用 python requets库，将数据推送到 webservice接口上， webservice接口负责保存数据到mongoDB数据库。

实现步骤：

1、使用requests库，与webservice接口对接。

2、使用scrapy抓取数据。

3、结合1 2 实现完整功能。

Requests库（Save to DB through restful api）

库的安装和快速入门见：

http://docs.python-requests.org/en/master/user/quickstart/#response-content

给出测试通过示例代码：

insert_to_db.py

import requests

resp = requests.get('http://localhost:3000/api/v1/summary')

# ------------- GET --------------
if resp.status_code != 200:
     # This means something went wrong.
     raise ApiError('GET /tasks/ {}'.format(resp.status_code))

for todo_item in resp.json():
     print('{} {}'.format(todo_item['Technology'], todo_item['Count']))

# ------------- POST --------------
Technology = {"Technology": "Django", "Count": "50" }

resp = requests.post('http://localhost:3000/api/v1/summary', json=Technology)
if resp.status_code != 201:
     raise ApiError('POST /Technologys/ {}'.format(resp.status_code))

print("-------------------")
print(resp.text)

print('Created Technology. ID: {}'.format(resp.json()["_id"])

Python VirutalEnv运行环境

https://realpython.com/python-virtual-environments-a-primer/

Create a new virtual environment inside the directory:
# Python 2:

$ virtualenv env

# Python 3

$ python3 -m venv env
Note: By default, this will not include any of your existing site packages.

windows 激活：

env\Scripts\activate

Scrapy（Scratch data）

https://scrapy.org/

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/architecture.html

安装和使用参考：

https://www.cnblogs.com/lightsong/p/8732537.html

安装和运行过程报错解决办法：

1、 Scrapy运行ImportError: No module named win32api错误

https://blog.csdn.net/u013687632/article/details/57075514

pip install pypiwin32

2、 error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

https://www.cnblogs.com/baxianhua/p/8996715.html

1. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下载twisted对应版本的whl文件（我的Twisted‑17.5.0‑cp36‑cp36m‑win_amd64.whl)，cp后面是python版本，amd64代表64位，

2. 运行命令：
pip install C:\Users\CR\Downloads\Twisted-17.5.0-cp36-cp36m-win_amd64.whl

给出示例代码：

quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
     name = "quotes"
     start_urls = [
         'http://quotes.toscrape.com/tag/humor/',
     ]

def parse(self, response):
         for quote in response.css('div.quote'):
             yield {
                 'text': quote.css('span.text::text').extract_first(),
                 'author': quote.xpath('span/small/text()').extract_first(),
             }

next_page = response.css('li.next a::attr("href")').extract_first()
         if next_page is not None:
             yield response.follow(next_page, self.parse)

在此目录下，运行

scrapy runspider quotes_spider.py -o quotes.json

输出结果

[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]

业务全流程实例

https://github.com/fanqingsong/web_data_visualization

由于zhipin网站对爬虫有反制策略，本例子采用scrapy的官方爬取实例quotes为研究对象。

流程为：

1、爬取数据， scrapy 的两个组件 spider & item pipeline

2、存数据库， requests库的post方法推送数据到 webservice_quotes服务器的api

3、 webservice_quotes将数据保存到mongoDB

4、浏览器访问vue页面，与websocket_quotes服务器建立连接

5、 websocket_quotes定期（每隔1s）从mongoDB中读取数据，推送给浏览器端，缓存为Vue应用的data，data绑定到模板视图

scrapy item pipeline 推送数据到webservice接口

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import requests

class ScratchZhipinPipeline(object):
     def process_item(self, item, spider):

print("--------------------")
         print(item['text'])
         print(item['author'])
         print("--------------------")

# save to db through web service
         resp = requests.post('http://localhost:3001/api/v1/quote', json=item)
         if resp.status_code != 201:
             raise ApiError('POST /item/ {}'.format(resp.status_code))
         print(resp.text)
         print('Created Technology. ID: {}'.format(resp.json()["_id"]))

return item

爬虫运行: scrapy crawl quotes

webservice运行: npm run webservice_quotes

websocket运行: npm run websocket_quotes

vue调试环境运行： npm run dev

chrome:

db:

Python生成requirement.text文件

http://www.cnblogs.com/zhaoyingjie/p/6645811.html

快速生成requirement.txt的安装文件

(CenterDesigner) xinghe@xinghe:~/PycharmProjects/CenterDesigner$ pip freeze > requirements.txt

安装所需要的文件

pip install -r requirement.txt

web全栈应用【爬取（scrapy）数据 -> 通过restful接口存入数据库 -> websocket推送展示到前台】的更多相关文章

从0开始学爬虫8使用requests/pymysql和beautifulsoup4爬取维基百科词条链接并存入数据库
从0开始学爬虫8使用requests和beautifulsoup4爬取维基百科词条链接并存入数据库 Python使用requests和beautifulsoup4爬取维基百科词条链接并存入数据库参考 ...
python3爬取百度知道的问答并存入数据库(MySQL)
一.链接分析: 以"Linux"为搜索的关键字为例: 首页的链接为:https://zhidao.baidu.com/search?lm=0&rn=10&pn=0& ...
web全栈架构师[笔记] — 02 数据交互
数据交互一.http协议基本特点 1.无状态的协议 2.连接过程:发送连接请求.响应接受.发送请求 3.消息分两块:头.体 http和https 二.form 基本属性 action——提交到哪儿 ...
python爬取网页数据并存储到mysql数据库
#python 3.5 from urllib.request import urlopen from urllib.request import urlretrieve from bs4 impor ...
NodeJs简单七行爬虫--爬取自己Qzone的说说并存入数据库
没有那么难的,嘿嘿,说起来呢其实挺简单的,或者不能叫爬虫,只需要将自己的数据加载到程序里再进行解析就可以了,如果说你的Qzone是向所有人开放的,那么就有一个JSONP的接口,这么说来就简单了,也就不 ...
python爬虫爬取ip记录网站信息并存入数据库
import requests import re import pymysql #10页仔细观察路由 db = pymysql.connect("localhost",&quo ...
python爬取拉勾网数据并进行数据可视化
爬取拉勾网关于python职位相关的数据信息,并将爬取的数据已csv各式存入文件,然后对csv文件相关字段的数据进行清洗,并对数据可视化展示,包括柱状图展示.直方图展示.词云展示等并根据可视化的数据做 ...
Web 全栈大会：万维网之父的数据主权革命
大家好,今天我和大家分享一下由万维网之父发起的一场数据主权革命.什么叫数据主权?很容易理解,现在我们的数据是把持在巨头手里的,你的微信通讯录和聊天记录都无法导出,不管是从人权角度还是从法理角度,这些数 ...
web scraper——简单的爬取数据【二】
web scraper——安装[一] 在上文中我们已经安装好了web scraper现在我们来进行简单的爬取,就来爬取百度的实时热点吧. http://top.baidu.com/buzz?b=1&a ...

随机推荐

ORACLE跨数据库查询的方法
原文地址:http://blog.csdn.net/huzhenwei/article/details/2533869 本文简述了通过创建database link实现Oracle跨数据库查询的方法 ...
pc端移动端拖拽实现
#div1 { width: 100px; height: 100px; background: red; position: absolute; } html <div id="di ...
慢日志查询python flask sqlalchemy慢日志记录
engine = create_engine(ProdConfig.SQLALCHEMY_DATABASE_URI, echo=True) app = Flask(__name__) app.conf ...
centos7下kubernetes（13。kubernetes-探讨service IP）
service cluster IP是一个虚拟IP,是由kubernetes节点上的iptables规则管理的通过iptables-save | grep 10.105.215.156看到与clus ...
基于 HTML5 WebGL 的地铁站 3D 可视化系统
前言工业互联网,物联网,可视化等名词在我们现在信息化的大背景下已经是耳熟能详,日常生活的交通,出行,吃穿等可能都可以用信息化的方式来为我们表达,在传统的可视化监控领域,一般都是基于 Web SCAD ...
Entity Framework Core系列之实战（ASP.NET Core MVC应用程序）
本示例演示在ASP.NET 应用程序中使用EF CORE创建数据库并对其做基本的增删改查操作.当然我们默认你的机器上已经安装了.NET CORE SDK以及合适的IDE.本例使用的是Visual St ...
Python基础知识3-函数、参数及参数解构
函数函数定义.调用函数参数函数参数默认参数函数参数默认值可变参数 keyword-only参数可变参数和参数默认值函数参数参数解构练习: #编写一个函数,能够接受至少2个参数 def ...
Python 使用 matplotlib绘制3D图形
3D图形在数据分析.数据建模.图形和图像处理等领域中都有着广泛的应用,下面将给大家介绍一下如何在Python中使用 matplotlib进行3D图形的绘制,包括3D散点.3D表面.3D轮廓.3D直线( ...
使用ffmpeg视频切片并加密
想达到的目的:将一个mp4视频文件切割为多个ts片段,并在切割过程中对每一个片段使用 AES-128 加密,最后生成一个m3u8的视频索引文件: 电脑环境 Fedora,已经安装了最新的ffmpeg: ...
linux文件目录权限和系统基础优化命令(yum源配置)
一.用户 1.介绍我们都知道linux中有root用户和普通用户,但是同样是普通用户,为什么有些用户的权限却不一样呢?其实这就类似于我们的QQ群,root用户就是QQ群主,他拥有最高的权利,想干什么 ...

web全栈应用【爬取（scrapy）数据 -> 通过restful接口存入数据库 -> websocket推送展示到前台】

Requests库 （Save to DB through restful api）