requests+mongodb爬取今日头条，多进程

 import json

 import os

 from urllib.parse import urlencode

 import pymongo

 import requests

 from bs4 import BeautifulSoup

 from requests.exceptions import ConnectionError

 import re

 from multiprocessing import Pool

 from hashlib import md5

 from json.decoder import JSONDecodeError

 from config import *

 client = pymongo.MongoClient(MONGO_URL, connect=False)

 db = client[MONGO_DB]

 def get_page_index(offset, keyword):

     data = {

         'autoload': 'true',

         'count': 20,

         'cur_tab': 3,

         'format': 'json',

         'keyword': keyword,

         'offset': offset,

     }

     params = urlencode(data)

     base = 'http://www.toutiao.com/search_content/'

     url = base + '?' + params

     try:

         response = requests.get(url)

         if response.status_code == 200:

             return response.text

         return None

     except ConnectionError:

         print('Error occurred')

         return None

 def download_image(url):

     print('Downloading', url)

     try:

         response = requests.get(url)

         if response.status_code == 200:

             save_image(response.content)

         return None

     except ConnectionError:

         return None

 def save_image(content):

     file_path = '{0}/{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')

     print(file_path)

     if not os.path.exists(file_path):

         with open(file_path, 'wb') as f:

             f.write(content)

             f.close()

 def parse_page_index(text):

     try:

         data = json.loads(text)

         if data and 'data' in data.keys():

             for item in data.get('data'):

                 yield item.get('article_url')

     except JSONDecodeError:

         pass

 def get_page_detail(url):

     try:

         response = requests.get(url)

         if response.status_code == 200:

             return response.text

         return None

     except ConnectionError:

         print('Error occurred')

         return None

 def parse_page_detail(html, url):

     soup = BeautifulSoup(html, 'lxml')

     result = soup.select('title')

     title = result[0].get_text() if result else ''

     images_pattern = re.compile('gallery: JSON.parse\("(.*)"\)', re.S)

     result = re.search(images_pattern, html)

     if result:

         data = json.loads(result.group(1).replace('\\', ''))

         if data and 'sub_images' in data.keys():

             sub_images = data.get('sub_images')

             images = [item.get('url') for item in sub_images]

             for image in images: download_image(image)

             return {

                 'title': title,

                 'url': url,

                 'images': images

             }

 def save_to_mongo(result):

     if db[MONGO_TABLE].insert(result):

         print('Successfully Saved to Mongo', result)

         return True

     return False

 def main(offset):

     text = get_page_index(offset, KEYWORD)

     urls = parse_page_index(text)

     for url in urls:

         html = get_page_detail(url)

         result = parse_page_detail(html, url)

         if result: save_to_mongo(result)

 if __name__ == '__main__':

     pool = Pool()

     groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])

     pool.map(main, groups)

     pool.close()

     pool.join()

requests+mongodb爬取今日头条，多进程的更多相关文章

PYTHON 爬虫笔记九:利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集（实战项目二）
利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集目标站点分析今日头条这类的网站制作,从数据形式,CSS样式都是通过数据接口的样式来决定的,所以它的抓取方法和其他网页的抓取方 ...
用Ajax爬取今日头条图片集
Ajax原理在用requests抓取页面时,得到的结果可能和浏览器中看到的不一样:在浏览器中可以正常显示的页面数据,但用requests得到的结果并没有.这是因为requests获取的都是原始 ...
使用scrapy爬虫,爬取今日头条搜索吉林疫苗新闻（scrapy+selenium+PhantomJS）
这一阵子吉林疫苗案,备受大家关注,索性使用爬虫来爬取今日头条搜索吉林疫苗的新闻依然使用三件套(scrapy+selenium+PhantomJS)来爬取新闻以下是搜索页面,得到吉林疫苗的搜索信息, ...
使用scrapy爬虫,爬取今日头条首页推荐新闻（scrapy+selenium+PhantomJS）
爬取今日头条https://www.toutiao.com/首页推荐的新闻,打开网址得到如下界面查看源代码你会发现全是js代码,说明今日头条的内容是通过js动态生成的. 用火狐浏览器F12查看得知 ...
Python3从零开始爬取今日头条的新闻【一、开发环境搭建】
Python3从零开始爬取今日头条的新闻[一.开发环境搭建] Python3从零开始爬取今日头条的新闻[二.首页热点新闻抓取] Python3从零开始爬取今日头条的新闻[三.滚动到底自动加载] Pyt ...
Python3从零开始爬取今日头条的新闻【四、模拟点击切换tab标签获取内容】
Python3从零开始爬取今日头条的新闻[一.开发环境搭建] Python3从零开始爬取今日头条的新闻[二.首页热点新闻抓取] Python3从零开始爬取今日头条的新闻[三.滚动到底自动加载] Pyt ...
Python3从零开始爬取今日头条的新闻【三、滚动到底自动加载】
Python3从零开始爬取今日头条的新闻[一.开发环境搭建] Python3从零开始爬取今日头条的新闻[二.首页热点新闻抓取] Python3从零开始爬取今日头条的新闻[三.滚动到底自动加载] Pyt ...
Python3从零开始爬取今日头条的新闻【二、首页热点新闻抓取】
Python3从零开始爬取今日头条的新闻[一.开发环境搭建] Python3从零开始爬取今日头条的新闻[二.首页热点新闻抓取] Python3从零开始爬取今日头条的新闻[三.滚动到底自动加载] Pyt ...
使用python-aiohttp爬取今日头条
http://blog.csdn.net/u011475134/article/details/70198533 原出处在上一篇文章<使用python-aiohttp爬取网易云音乐>中, ...

随机推荐

dctcp-ns2-patch
diff -crbB ns-allinone-2.35/ns-2.35/queue/red.cc ns-2.35/queue/red.cc *** ns-allinone--- :: --- ns-- ...
【Leetcode】【Medium】Search a 2D Matrix
Write an efficient algorithm that searches for a value in an m x n matrix. This matrix has the follo ...
QT网络编程UDP下C/S架构广播通信
QT有封装好的UDP协议的类,QUdpSocket,里面有我们想要的函数接口.感兴趣的话,可以看看. 先搞服务端吧,写一个子类,继承QDialog类,起名为UdpServer类.头文件要引用我们上边说 ...
February 6 2017 Week 6 Monday
There are no shortcuts to any place worth going. 任何值得去的地方,都没有捷径. Several years ago, I climbed the Hu ...
Go语言（二）继承和重载
继承 package main import "fmt" type Skills []string type person struct { name string age int ...
hiredis
hiredis是redis开源库对外发布的客户端API包. 当redis-server配置启动后,可以通过hiredis操作redis资源. 主要分为: strings.hash.lists.sets ...
SAP CRM One Order跟踪和日志工具CRMD_TRACE_SET
事务码CRMD_TRACE_SET激活跟踪模式: 在跟踪模式下运行One Order场景.运行完毕后,使用事务码CRMD_TRACE_EVAL: 双击参数,就能看到参数明细: 点Callstack也能 ...
MySQL闪回-binlog2sql
功能提取SQL 生成回滚SQL 限制: mysql server必须开启,离线模式下不能解析binlog. binlog格式必须是row模式. flashback模式只支持DML,DDL将不 ...
mysql执行sql文件
mysql -uspider_55haitao -pspider_55haitao -Dspider_55haitao</home/gphonebbs/Dump20161109.sql 方法一 ...
UVa 1637 - Double Patience（概率DP）
链接: https://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem& ...

requests+mongodb爬取今日头条，多进程

requests+mongodb爬取今日头条，多进程的更多相关文章

随机推荐

热门专题